Papers
Topics
Authors
Recent
Search
2000 character limit reached

REDO: Repetitive Doubling in ASR Attacks

Updated 6 January 2026
  • REDO is a multi-stage adversarial objective designed to double output sequences in ASR models, boosting both word error rate and transcript length.
  • It employs a two-stage projected gradient descent process that alternates between degrading accuracy and enforcing repetition using EOS suppression and a curriculum-based REDO loss.
  • Empirical results on LibriSpeech and LJ-Speech show that REDO, as part of the MORE framework, achieves 8–14× transcript elongation while maintaining high adversarial impact.

The Repetitive Encouragement Doubling Objective (REDO) is a multi-stage objective introduced to amplify adversarial attack efficiency in large-scale automatic speech recognition (ASR) models with autoregressive decoders, such as Whisper. REDO specifically targets both prediction accuracy and inference efficiency by periodically doubling the decoded sequence target, leading to stable repetitive loops and substantial computational cost amplification. REDO operates as a core component within the MORE (Multi-Objective Repetitive Doubling Encouragement) attack framework, optimizing for both high word error rate (WER) and maximal transcript elongation in a hierarchical projected gradient descent (PGD) loop (Gao et al., 5 Jan 2026).

1. Mathematical Formulation and Optimization Framework

REDO is embedded in a two-stage PGD adversarial optimization, where the perturbation δ\delta to input audio XX is solved under the multi-objective:

maxδΔϵ(WER(f(X+δ),Y),f(X+δ))\max_{\delta \in \Delta_\epsilon} \bigl( \text{WER}(f(X+\delta), Y),\, |f(X+\delta)| \bigr)

where ff is the ASR model and YY is the ground-truth transcript. The PGD process is divided into:

  • Repulsion Stage (i=1,,Kai=1, \dotsc, K_a): Maximizes WER via the loss Lacc(δ)=CE(f(X+δ),Y)L_\text{acc}(\delta) = -\mathrm{CE}(f(X+\delta), Y), pushing the model away from YY.
  • Anchoring Stage (i=Ka+1,,Ki=K_a+1, \dotsc, K): Maximizes output sequence length and induces repetition through two synergistic losses:
    • EOS Suppression: LEOS(δ)=P(EOSX+δ,L)P(zX+δ,L)L_\text{EOS}(\delta) = P(\text{EOS}\mid X+\delta,L) - P(z\mid X+\delta,L), discouraging end-of-sequence generation.
    • REDO Loss: For each curriculum block of DD PGD steps, construct a doubled target Y(m)=[BmBm]Y^{(m)} = [B_m \,\Vert\, B_m] where BmB_m is the prefix of the current greedy decode (w/o EOS), and optimize LREDO(m)(δ)=CE(f(X+δ),Y(m))L_\mathrm{REDO}^{(m)}(\delta) = \mathrm{CE}(f(X+\delta), Y^{(m)}).
  • Joint Efficiency Loss: Leff(δ)=LEOS(δ)+LREDO(m)(δ)L_\text{eff}(\delta) = L_\text{EOS}(\delta) + L_\mathrm{REDO}^{(m)}(\delta).

The attack thus alternates δLacc\nabla_\delta L_\text{acc} (repulsion) and δLeff\nabla_\delta L_\text{eff} (anchoring). The REDO loss is held fixed within each block to ensure stable optimization.

2. Algorithmic Workflow and Pseudocode Structure

REDO is tightly integrated within the MORE adversarial generation loop, as outlined in Algorithm 1 of (Gao et al., 5 Jan 2026). The process can be summarized as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1. Input: Audio X, ground truth Y, ε, α, K, Ka, D, model f
2. Initialize δ ← 0
3. // Repulsion stage (i = 1..Ka)
   For i = 1 to Ka:
       Compute L_acc = -CE(f(X+δ), Y)
       δ ← Clip_{\|·\|≤ε}(δ − α·sign(∇_δ L_acc))
4. // Anchoring stage (i = Ka+1..K)
   For i = Ka+1 to K:
       s ← i - Ka
       If (s-1) mod D == 0:
           \hat Y ← GreedyDecode(f, X+δ)
           B ← \hat Y_{1:|\hat Y|-1}  // Remove EOS
           Y' ← [B‖B]                // REDO target
       L_EOS ← P(EOS|X+δ,|B|*2) - max_{v≠EOS}P(v|X+δ,|B|*2)
       L_REDО ← CE(f(X+δ), Y')
       L_eff ← L_EOS + L_REDО
       δ ← Clip_{\|·\|≤ε}(δ − α·sign(∇_δ L_eff))
5. Return δ

The curriculum “doubled target” is refreshed once every DD iterations, creating a repetitive anchor and allowing the model to enter a repetition loop.

3. Theoretical Motivation and Properties

REDO exploits the autoregressive repetition bias in transformer decoders, which tend to form self-reinforcing loops when a token span re-emerges in context. By constructing Y(m)=[Bm,Bm]Y^{(m)} = [B_m, B_m] and training the model toward this repeated sequence, REDO amplifies the likelihood of stable repetition, as confirmed in [Xu et al., 2022]. This mechanism differs fundamentally from random token repetition and leverages context-driven attention dynamics.

From a computational perspective, the sequence length ltl_t after tt doubling blocks grows exponentially (lt2tl0l_t \approx 2^{t}l_0). Decoder self-attention complexity is O(l2)O(l^2), so the per-example FLOPs can increase as O(4tl02)O(4^t l_0^2). Only a few doubling cycles can thus escalate compute requirements dramatically before beam/length constraints are encountered. Appendix B of (Gao et al., 5 Jan 2026) details comprehensive FLOPs analyses demonstrating this effect.

4. Empirical Results and Ablation Analyses

Experiments were conducted on LibriSpeech and LJ-Speech, with Whisper (all model sizes) as the victim network and perturbation budgets set to SNR=35 dB (ϵ=0.002\epsilon=0.002) and SNR=30 dB (ϵ=0.0035\epsilon=0.0035). Key metrics are word error rate (WER, %) and decoded token length. The main findings include:

  • Baseline Comparison on Whisper-base (LibriSpeech):
    • PGD: WER=88.73%, length=31.65
    • SlothSpeech: WER=54.63%, length=156.07
    • MORE (with REDO): WER=88.42%, length=300.13

Across all model sizes and datasets:

  • MORE achieves comparable or higher WER than state-of-the-art accuracy-focused attacks (≈90%)
  • Output length produced by MORE with REDO is 8–14× longer than accuracy-only baselines
  • Transcript length under MORE outperforms the SlothSpeech attack by 2–3× while maintaining WER ≥50%

Ablation results (LJ-Speech, SNR=35 dB) highlight:

  • Removing LaccL_\text{acc} reduces WER to 27.6% but still yields \sim293 tokens (REDO sustains length)
  • Removing LREDOL_\mathrm{REDO} reduces length from 296 to 120 tokens
  • Removing all efficiency losses yields \sim30 tokens, WER ≈ 94% (accuracy-only baseline)

The combination Lacc+LEOS+LREDOL_\text{acc} + L_\text{EOS} + L_\mathrm{REDO} achieves the most robust joint attack on accuracy and efficiency.

5. Implementation Considerations and Practical Guidelines

Critical implementation factors include:

  • Doubling period DD: Governs curriculum frequency; D=10D=10 steps provides stable optimization and rapid transcript elongation.
  • Accuracy-only steps KaK_a: Sufficient KaK_a steps (e.g., 50) are needed to ensure high WER; too many constrain the attack’s ability to induce repetition in subsequent anchors.
  • PGD Step Size α\alpha: Should maintain imperceptibility within the ϵ\epsilon-ball; αϵ/5\alpha \approx \epsilon/5 per step is effective in reported experiments.
  • EOS Suppression: Although not strictly necessary for repetition, removing LEOSL_\text{EOS} reduces transcript length by \sim20%. EOS suppression is synergistic with REDO.
  • Efficiency on Other Decoders: REDO can be adapted to any autoregressive decoder with repetition loop tendencies. For decoders with length penalties or beam search, the curriculum block size DD and suppression routines may need adjustment.
  • Length Caps: Practical sequence length constraints may limit attainable compute amplification.

REDO represents a novel direction in adversarial attack methodology emphasizing not only accuracy degradation (maximizing WER) but also resource-intensive output manipulation through exponential transcript elongation. This approach stands apart from prior attacks, such as PGD and SlothSpeech, by optimizing for joint multi-objective adversarial robustness.

The application of REDO to other ASR systems or generative transformers requires ensuring the presence of repetition loop dynamics, a suitable initial accuracy-degradation stage, and potential adaptation for decoders employing beam search or length normalization. The role of structured repetition, as opposed to random or unstructured extension, is central to the attack’s stable long-sequence induction.

7. Summary and Significance

The Repetitive Encouragement Doubling Objective (REDO), as operationalized within MORE, enables targeted attacks on transformer-based ASR models that force joint degradation of transcription correctness and massive inflation of inference cost via transcript elongation. By leveraging curriculum-based repetition incentives and periodic anchor construction, REDO induces model behaviors that are costly both in accuracy and computational resources. Empirical results substantiate its superiority over prior baselines for both attack criteria (Gao et al., 5 Jan 2026). This suggests that efficiency-oriented adversarial objectives represent a significant, underexplored axis in robustness analysis for modern ASR systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Repetitive Encouragement Doubling Objective (REDO).