Papers
Topics
Authors
Recent
Search
2000 character limit reached

Insertion and Deletion of Interruption Token (IDIT)

Updated 5 January 2026
  • IDIT is a supervised technique that aligns LLM subword outputs with the token granularity required for code-switched ASR.
  • It segments mixed-language inputs into distinct units, ensuring each Chinese character and English word is represented by exactly one token.
  • IDIT integrates into a MoE-enhanced speech-to-text pipeline, improving accuracy and stability without altering the core model architecture.

The Insertion and Deletion of Interruption Token (IDIT) mechanism is a supervised label construction technique designed to address the misalignment between subword tokenization in LLMs and the token-level requirements of code-switching automatic speech recognition (ASR). IDIT enables effective transfer of LLM text generation capabilities into ASR for mixed-language (e.g., Mandarin/English) transcripts by enforcing explicit character- and word-level granularity at the label construction stage. It is introduced as a central component within a speech-conditioned LLM pipeline enhanced by a Mixture of Experts (MoE) connector (Zhang et al., 2024).

1. Formal Definition and Mathematical Formulation

Let XX be a ground-truth code-switched string composed of a sequence of Mandarin Chinese characters and English words. The IDIT method introduces a dedicated “interruption” token II (vocabulary index ii) and defines the following operators:

  • InsertI()\text{InsertI}(\cdot): injects II after every Chinese character and English word in XX.
  • Tokenizer()\text{Tokenizer}(\cdot): the LLM’s pretrained subword tokenizer (e.g., a 151,643-token BPE vocabulary).
  • RemoveI()\text{RemoveI}(\cdot): eliminates all II tokens from a tokenized sequence.

Define X=InsertI(X)X' = \text{InsertI}(X). The final sequence of training labels is constructed by tokenizing II0 and removing all occurrences of II1:

II2

The training objective is standard autoregressive next-token cross-entropy:

II3

No auxiliary regularization is added; the sole innovation is the construction of II4 so that the model outputs exactly one token per Chinese character or English word.

2. Algorithmic Construction and Implementation

The IDIT label construction algorithm segments the code-switched string into single Chinese characters or complete English words, appends the interruption token II5 after each, tokenizes the resulting sequence, and then removes all interruption tokens. During both training and inference, no interruption tokens are produced by the model itself; IDIT solely modifies the target sequence. InsertI()\text{InsertI}(\cdot)7 This procedure is applied for each training example. The model architecture—comprising speech encoder, MoE connector, and LLM with LoRA adapters—remains unchanged, and gradients propagate normally through all parameters since IDIT only affects the integer label targets in the loss function.

3. Integration with Mixture-of-Experts Connector and Training Regimen

The speech-conditioned LLM framework utilizing IDIT features a Mixture-of-Experts connector with two expert feedforward networks (II6): II7 for Mandarin and II8 for English. Frame-wise expert probabilities are computed by a weight-shared router,

II9

where ii0 are ii1-dimensional speech frame embeddings.

The two-stage progressive training schedule is as follows:

Stage LLM/LoRA IDIT Used Expert Routing Label Granularity Modules Trained
1: Alignment Frozen No Deterministic (language class) Native subword tokenization MoE connector only
2: Fine-tune Unfrozen Yes Softmax (Eqns above) Char-level (zh), word-level(en) MoE connector + LoRA adapters

This curriculum allows the connector to initially learn a robust cross-modal mapping before the LLM is adapted to the enforced label granularity.

4. Hyperparameters and Training Protocol

The following settings govern the empirical success of the IDIT-enhanced pipeline:

  • LLM tokenizer vocabulary size: 151,643 (Qwen2-7B).
  • Interruption token ID: a single additional vocab item.
  • Number of experts: 2 (ii2, ii3).
  • Expert FFN hidden dimension: 2048.
  • Router projection: ii4, ii5.
  • Downsampling (frame splicing): factor 5.
  • LoRA adapter hyperparameters: rank ii6, ii7.
  • Effective batch size: 18 (batch size 6, grad-accumulation 3).
  • Optimizer: AdamW, ii8, learning rate ii9, no weight decay.
  • Learning rate warmup: 1000 steps.
  • Hardware: 8 × Nvidia A800 GPUs, each training stage ≈ 25k steps.

Staged curriculum, specifically omitting IDIT and LoRA adaptation in stage 1, is a critical training trick for stable convergence and effective representation learning.

5. Illustrative Example of the IDIT Mechanism

For a code-switched segment InsertI()\text{InsertI}(\cdot)0:

  1. Segment InsertI()\text{InsertI}(\cdot)1 yields InsertI()\text{InsertI}(\cdot)2.
  2. Insertion: InsertI()\text{InsertI}(\cdot)3.
  3. Tokenization (BPE or similar): InsertI()\text{InsertI}(\cdot)4.
  4. Removal of InsertI()\text{InsertI}(\cdot)5 tokens: InsertI()\text{InsertI}(\cdot)6.

The final label sequence yields exactly one token per Chinese character and one per English word, tightly controlling output granularity and aligning token emission with spoken segments.

6. Effect and Significance in ASR for Code-switching

The IDIT mechanism eliminates spurious insertions and deletions that typically arise when LLM subword tokenization is misaligned with spoken unit boundaries, particularly in code-switched utterances. It achieves character-level resolution for Chinese and word-level for English, precisely matching linguistic intuition for mixed transcripts. Importantly, IDIT operates entirely outside the model forward pass, yielding a fully differentiable pipeline and avoiding complexity in model architecture or loss computation.

Within the MoE-enhanced speech-LM pipeline, IDIT enables the model to leverage pretrained LLM text generation capabilities in an ASR context without requiring structural changes to the model or introduction of extra loss terms. The approach is thus both lightweight and effective for code-switched speech recognition (Zhang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Insertion and Deletion of Interruption Token (IDIT) Mechanism.