Text Denoising Adapter in ASR
- Text Denoising Adapter (TDA) is a method for text-only domain adaptation in ASR that uses synthetic noise to simulate audio projector-induced distortions.
- It fine-tunes large language models by training them to recover clean transcripts from noisy inputs, thereby preserving cross-modal alignment without extra parameters.
- Empirical results demonstrate up to 22.1% WER improvement, offering a competitive alternative to both audio-supervised fine-tuning and previous text-only approaches.
The Text Denoising Adapter (TDA) is a lightweight method for text-only domain adaptation in automatic speech recognition (ASR) systems that use LLMs. TDA leverages synthetic text noise to emulate the distribution shift introduced by the speech projector in multimodal ASR. By training the LLM to recover clean transcripts from noisy text inputs that mimic audio-projected representations, TDA preserves cross-modal alignment and improves domain transfer, all without additional architectural changes or parameters (Burdisso et al., 28 Jan 2026).
1. Problem Formulation and Limitations of Prior Approaches
In LLM-based ASR architectures, a pretrained speech encoder feeds acoustic representations through a learnable speech projector into the LLM’s embedding space. The projector’s output, which can be regarded as “noisy text,” must be denoised by the LLM to yield accurate transcriptions. When only target-domain text is available, standard text-only fine-tuning of the LLM disrupts the projector-to-LLM alignment, leading to degraded ASR performance as the LLM forgets how to interpret the outputs of the speech projector.
Prior text-only adaptation methods attempted to preserve this alignment using early stopping based on perplexity monitoring (Fang et al.) or soft prompt embeddings (Ma et al.), both introducing additional monitoring and/or trainable modules with new hyperparameters. Such techniques require careful calibration and introduce complexity (Burdisso et al., 28 Jan 2026).
2. Text Denoising as Proxy for Audio-Induced Noise
TDA reframes text-only adaptation as a denoising problem, mimicking the corruption introduced in the audio-to-text mapping. Let be a clean transcript. The method defines a noisy-input generation function so that simulates the type of corruption produced by the speech projector and discretization.
The adaptation objective maximizes the log-likelihood of recovering from under the LLM’s parameters :
The synthetic corruption consists of random character substitution—selecting 15% of words, replacing 30% of their characters—and random character duplication, with a 10% chance for remaining characters to be repeated 1–3 times. This process reproduces the modalities’ misalignment as seen in multimodal ASR pipelines (Burdisso et al., 28 Jan 2026).
3. Alignment Preservation Through Synthetic Noise
The central analogy of TDA is to treat as a proxy for the output of the speech projector: in standard multimodal ASR, audio is encoded as embeddings 0, which behave as noisy text-like sequences. The LLM, during standard ASR training, is conditioned to reconstruct clean transcripts from these noisy sequences. By fine-tuning the LLM on pairs 1, TDA ensures the LLM remains matched to the speech projector’s distribution, while simultaneously adapting to target-domain linguistic content. This preserves critical cross-modal alignment and avoids catastrophic forgetting that otherwise occurs in naive adaptation (Burdisso et al., 28 Jan 2026).
4. Model Architecture and Training Dynamics
The TDA framework consists of a frozen speech encoder (e.g., WavLM-Large), a learnable speech projector (two linear layers with ReLU), and a LLM decoder (Llama 3B Instruct). The speech projector is trained exclusively on source audio-text pairs, while the LLM is adapted using mixed batches involving both real and synthetically noised data.
Batch Mixing Proportions
Each adaptation batch is a mixture of four example types:
- 2 Source audio-text pairs 3
- 4 Source audio projected through speech projector and discretization, paired with transcript 5
- 6 Synthetic noise applied to source transcripts 7
- 8 Synthetic noise applied to target-domain transcripts 9
Proportions are set as 0 and the remainder equally split among 1, 2, and 3:
4
Hyperparameters
- Projector training: learning rate 5, warmup 6 steps, batch size 7, 8 epochs.
- TDA fine-tuning: learning rate 9, warmup 0 steps, batch size 1, approximately 2 epochs (varied by domain).
- Synthetic noise uses nlpaug defaults except with reduced word-edit rate 3 (Burdisso et al., 28 Jan 2026).
No new modules or parameters are introduced relative to the pretrained ASR model, in contrast to soft-prompt methods.
5. Empirical Evaluation and Results
Benchmarked Datasets
- DefinedAI: 125h of conversational audio, split between source domains (Banking, Insurance, Healthcare) and target-only text domains (Banking or Insurance).
- SlideSpeech: Audio-visual YouTube dataset; source domains (Life, Talent, English), target-only text in Agriculture, Animation, and Musical Instruments.
Baselines
- Base Model: Projector trained, LLM frozen.
- Audio adaptation: LLM fine-tuned on target-domain audio/text using LoRA on self-attention layers.
- Recent text-only methods: perplexity-monitored early stopping (Fang et al.), soft-prompt embeddings (Ma et al.).
Results
| Domain | Base WER | Audio Adapt | Fang et al. | Ma et al. | TDA (Ours) |
|---|---|---|---|---|---|
| DefinedAI → Banking | 12.98% | 9.92% (-23.6%) | 10.92% (-15.9%) | 10.63% (-18.1%) | 10.11% (-22.1%) |
| DefinedAI → Insurance | - | - | - | - | -17.9% rel. gain |
| SlideSpeech (out-of-domain) | - | - | - | - | 4–8% rel. gain |
| SlideSpeech (cross-domain) | - | - | - | - | 10–15% rel. gain |
Relative improvement is defined as 4.
TDA demonstrates up to 22.1% relative improvement in WER, approaching the upper bound achieved with audio-supervised fine-tuning and outperforming competing text-only methods.
6. Ablation Studies and Analysis
Batch composition ablations reveal that omitting the real audio component 5 leads to catastrophic forgetting (WER 6 73%). Models using all three source modalities 7 achieve optimal results. Substituting or omitting the synthetic noise process for target text reduces gains (e.g., echoing prompt yields -20.6% vs -22.1%, empty prompts -18.6%, no prompt -18.9%), underscoring the necessity of explicit denoising.
7. Summary of TDA Training Workflow
A high-level pseudocode summary is provided for reference:
8
This approach ensures robust cross-modal alignment while imparting target-domain specialization, achieving strong gains without model expansion or additional hyperparameters (Burdisso et al., 28 Jan 2026).