REDO: Repetitive Doubling in ASR Attacks
- REDO is a multi-stage adversarial objective designed to double output sequences in ASR models, boosting both word error rate and transcript length.
- It employs a two-stage projected gradient descent process that alternates between degrading accuracy and enforcing repetition using EOS suppression and a curriculum-based REDO loss.
- Empirical results on LibriSpeech and LJ-Speech show that REDO, as part of the MORE framework, achieves 8–14× transcript elongation while maintaining high adversarial impact.
The Repetitive Encouragement Doubling Objective (REDO) is a multi-stage objective introduced to amplify adversarial attack efficiency in large-scale automatic speech recognition (ASR) models with autoregressive decoders, such as Whisper. REDO specifically targets both prediction accuracy and inference efficiency by periodically doubling the decoded sequence target, leading to stable repetitive loops and substantial computational cost amplification. REDO operates as a core component within the MORE (Multi-Objective Repetitive Doubling Encouragement) attack framework, optimizing for both high word error rate (WER) and maximal transcript elongation in a hierarchical projected gradient descent (PGD) loop (Gao et al., 5 Jan 2026).
1. Mathematical Formulation and Optimization Framework
REDO is embedded in a two-stage PGD adversarial optimization, where the perturbation to input audio is solved under the multi-objective:
where is the ASR model and is the ground-truth transcript. The PGD process is divided into:
- Repulsion Stage (): Maximizes WER via the loss , pushing the model away from .
- Anchoring Stage (): Maximizes output sequence length and induces repetition through two synergistic losses:
- EOS Suppression: , discouraging end-of-sequence generation.
- REDO Loss: For each curriculum block of PGD steps, construct a doubled target where is the prefix of the current greedy decode (w/o EOS), and optimize .
- Joint Efficiency Loss: .
The attack thus alternates (repulsion) and (anchoring). The REDO loss is held fixed within each block to ensure stable optimization.
2. Algorithmic Workflow and Pseudocode Structure
REDO is tightly integrated within the MORE adversarial generation loop, as outlined in Algorithm 1 of (Gao et al., 5 Jan 2026). The process can be summarized as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
1. Input: Audio X, ground truth Y, ε, α, K, Ka, D, model f
2. Initialize δ ← 0
3. // Repulsion stage (i = 1..Ka)
For i = 1 to Ka:
Compute L_acc = -CE(f(X+δ), Y)
δ ← Clip_{\|·\|≤ε}(δ − α·sign(∇_δ L_acc))
4. // Anchoring stage (i = Ka+1..K)
For i = Ka+1 to K:
s ← i - Ka
If (s-1) mod D == 0:
\hat Y ← GreedyDecode(f, X+δ)
B ← \hat Y_{1:|\hat Y|-1} // Remove EOS
Y' ← [B‖B] // REDO target
L_EOS ← P(EOS|X+δ,|B|*2) - max_{v≠EOS}P(v|X+δ,|B|*2)
L_REDО ← CE(f(X+δ), Y')
L_eff ← L_EOS + L_REDО
δ ← Clip_{\|·\|≤ε}(δ − α·sign(∇_δ L_eff))
5. Return δ |
The curriculum “doubled target” is refreshed once every iterations, creating a repetitive anchor and allowing the model to enter a repetition loop.
3. Theoretical Motivation and Properties
REDO exploits the autoregressive repetition bias in transformer decoders, which tend to form self-reinforcing loops when a token span re-emerges in context. By constructing and training the model toward this repeated sequence, REDO amplifies the likelihood of stable repetition, as confirmed in [Xu et al., 2022]. This mechanism differs fundamentally from random token repetition and leverages context-driven attention dynamics.
From a computational perspective, the sequence length after doubling blocks grows exponentially (). Decoder self-attention complexity is , so the per-example FLOPs can increase as . Only a few doubling cycles can thus escalate compute requirements dramatically before beam/length constraints are encountered. Appendix B of (Gao et al., 5 Jan 2026) details comprehensive FLOPs analyses demonstrating this effect.
4. Empirical Results and Ablation Analyses
Experiments were conducted on LibriSpeech and LJ-Speech, with Whisper (all model sizes) as the victim network and perturbation budgets set to SNR=35 dB () and SNR=30 dB (). Key metrics are word error rate (WER, %) and decoded token length. The main findings include:
- Baseline Comparison on Whisper-base (LibriSpeech):
- PGD: WER=88.73%, length=31.65
- SlothSpeech: WER=54.63%, length=156.07
- MORE (with REDO): WER=88.42%, length=300.13
Across all model sizes and datasets:
- MORE achieves comparable or higher WER than state-of-the-art accuracy-focused attacks (≈90%)
- Output length produced by MORE with REDO is 8–14× longer than accuracy-only baselines
- Transcript length under MORE outperforms the SlothSpeech attack by 2–3× while maintaining WER ≥50%
Ablation results (LJ-Speech, SNR=35 dB) highlight:
- Removing reduces WER to 27.6% but still yields 293 tokens (REDO sustains length)
- Removing reduces length from 296 to 120 tokens
- Removing all efficiency losses yields 30 tokens, WER ≈ 94% (accuracy-only baseline)
The combination achieves the most robust joint attack on accuracy and efficiency.
5. Implementation Considerations and Practical Guidelines
Critical implementation factors include:
- Doubling period : Governs curriculum frequency; steps provides stable optimization and rapid transcript elongation.
- Accuracy-only steps : Sufficient steps (e.g., 50) are needed to ensure high WER; too many constrain the attack’s ability to induce repetition in subsequent anchors.
- PGD Step Size : Should maintain imperceptibility within the -ball; per step is effective in reported experiments.
- EOS Suppression: Although not strictly necessary for repetition, removing reduces transcript length by 20%. EOS suppression is synergistic with REDO.
- Efficiency on Other Decoders: REDO can be adapted to any autoregressive decoder with repetition loop tendencies. For decoders with length penalties or beam search, the curriculum block size and suppression routines may need adjustment.
- Length Caps: Practical sequence length constraints may limit attainable compute amplification.
6. Context, Extensions, and Related Work
REDO represents a novel direction in adversarial attack methodology emphasizing not only accuracy degradation (maximizing WER) but also resource-intensive output manipulation through exponential transcript elongation. This approach stands apart from prior attacks, such as PGD and SlothSpeech, by optimizing for joint multi-objective adversarial robustness.
The application of REDO to other ASR systems or generative transformers requires ensuring the presence of repetition loop dynamics, a suitable initial accuracy-degradation stage, and potential adaptation for decoders employing beam search or length normalization. The role of structured repetition, as opposed to random or unstructured extension, is central to the attack’s stable long-sequence induction.
7. Summary and Significance
The Repetitive Encouragement Doubling Objective (REDO), as operationalized within MORE, enables targeted attacks on transformer-based ASR models that force joint degradation of transcription correctness and massive inflation of inference cost via transcript elongation. By leveraging curriculum-based repetition incentives and periodic anchor construction, REDO induces model behaviors that are costly both in accuracy and computational resources. Empirical results substantiate its superiority over prior baselines for both attack criteria (Gao et al., 5 Jan 2026). This suggests that efficiency-oriented adversarial objectives represent a significant, underexplored axis in robustness analysis for modern ASR systems.