Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-Lingual RTT Attacks in NLP

Updated 15 January 2026
  • The paper introduces RTT attacks, where text is transformed through neural machine translation to obfuscate token-level, syntactic, and semantic features.
  • Methodologies like DRTT and CLSA manipulate translations and summarizations to systematically degrade watermark detection and adversarial robustness.
  • Empirical results show marked drops in watermark detection accuracy and adversarial effectiveness, highlighting the need for robust countermeasures.

Cross-lingual round-trip translation (RTT) attacks are adversarial or obfuscation techniques that leverage neural machine translation (NMT) systems to undermine the robustness of NLP models, watermarking detectors, or data provenance schemes. By passing text through one or more translation systems (and sometimes other semantic bottlenecks), RTT attacks systematically disrupt token-level, syntactic, and even semantic properties, exposing vulnerabilities in models or watermarking methods that rely on language-specific or local statistical patterns.

1. Core Principles and Taxonomy of RTT Attacks

RTT attacks consist of transforming an input text xx by translating it into a pivot or target language ℓp\ell_p, and optionally back to the original language, thus x′=Mℓp→ℓs(Mℓs→ℓp(x))x' = M_{\ell_p \to \ell_s}(M_{\ell_s \to \ell_p}(x)). The process introduces variation through vocabulary mapping, syntactic reordering, and loss of stable token indices, thereby obfuscating patterns seeded by watermarking algorithms or breaking the perturbative structure of adversarial manipulations. Cross-lingual Summarization Attacks (CLSA) and Doubly Round-Trip Translation (DRTT) attacks extend this principle, introducing summarization steps or dual RTT loops to further degrade statistical artifacts or to control adversarial example authenticity (Lai et al., 2022, Ganesan, 27 Oct 2025, Tariqul et al., 8 Jan 2026).

Key variants include:

  • Single-RTT attacks: Simple source→\topivot→\tosource loops.
  • Doubly Round-Trip Translation (DRTT): Applies both source→\totarget→\tosource and target→\tosource→\totarget loops to ensure adversarial error can be uniquely attributed (Lai et al., 2022).
  • CLSA (with or without back-translation): Translation, summarization, and optional back-translation to maximize semantic-bottleneck effects (Ganesan, 27 Oct 2025).

2. RTT in Adversarial Example Generation for NMT

In adversarial NMT, RTT attacks are used to craft examples where meaning preservation is not a hard constraint. In the single-RTT adversarial setup, a perturbed source x′x' is deemed adversarial if the similarity drop dsrc(x,x′)=sim(x,x^)−sim(x′,x^′)sim(x,x^)d_{\mathrm{src}}(x, x') = \frac{sim(x, \hat{x}) - sim(x', \hat{x}')}{sim(x, \hat{x})} exceeds a threshold β0\beta_0, with x^=g(f(x))\hat{x} = g(f(x)) and x^′=g(f(x′))\hat{x}'=g(f(x')) for NMT models ff (forward) and gg (backward) (Lai et al., 2022). However, this criterion is vulnerable: it cannot distinguish between errors arising in ff and gg.

DRTT attacks address this by:

  • Defining an additional target-side similarity drop dtgt(y,y′)=sim(y,y^)−sim(y′,y^′)sim(y,y^)d_{\mathrm{tgt}}(y, y') = \frac{sim(y, \hat{y})-sim(y',\hat{y}')}{sim(y,\hat{y})},
  • Accepting a pair as adversarial only if dsrc(x,x′)>βd_{\mathrm{src}}(x,x')>\beta and dtgt(y,y′)<γd_{\mathrm{tgt}}(y,y')<\gamma.

This guarantees the adversarial effect is genuinely attributable to the target model. Incorporation of masked LLMs to propose phrase-level substitutions and tight source-target alignment allow construction of robust bilingual adversarial pairs, used for both analysis and adversarial training (Lai et al., 2022).

3. RTT Attacks against Watermarking and Provenance Systems

Cross-lingual RTT attacks directly target watermarking approaches by leveraging the translation pipeline to disrupt token-level statistical cues, such as green-list bias, nn-gram statistics, or semantic-invariant fingerprints (Ganesan, 27 Oct 2025, Tariqul et al., 8 Jan 2026). The attack pipeline is formalized as follows:

Ts→p(x)→SpSp(Ts→p(x))→Tp→sTp→s(Sp(Ts→p(x)))T_{s\to p}(x) \xrightarrow{S_p} S_{p}(T_{s\to p}(x)) \xrightarrow{T_{p\to s}} T_{p\to s}(S_p(T_{s\to p}(x)))

Each step operates as follows:

  • Pivot Translation (Ts→pT_{s\to p}): Alters subword units, disrupts vocabulary alignment.
  • Abstractive Summarization (SpS_p): Compresses content, deletes watermarked token positions, merges paraphrases, changes frequency dynamics.
  • Back-Translation (Tp→sT_{p\to s}, optional): Reinserts stochastic variation, further decorrelating from the original.

Measurement using AUROC for several watermarking schemes (KGW, XSIR, Unigram) demonstrates almost complete collapse of detection accuracy—e.g., AUROC for XSIR drops from $0.827$ (paraphrase) or $0.823$ (cross-lingual remapping) to $0.53$ (CLSA, near chance) (Ganesan, 27 Oct 2025). In low-resource languages like Bangla, single-layer watermark detection accuracy collapses from 88−91%88-91\% (benign) to 9−13%9-13\% post-RTT (Tariqul et al., 8 Jan 2026).

4. Measurement Protocols and Robustness Metrics

Different studies use encompassing metrics to quantify the effects of RTT attacks:

  • Adversarial Attack Success under RTT: Srtt(k)S_{\mathrm{rtt}}(k), the success rate after RTT through kk languages. The round-trip robustness ratio is R(k)=Srtt(k)/SorigR(k) = S_{\mathrm{rtt}}(k)/S_\mathrm{orig} (Bhandari et al., 2023).
  • Watermark Detection Metrics: AUROC, equal error rate (EER), TPR@1%FPR (Ganesan, 27 Oct 2025).
  • Token-Level Statistics: Detection accuracy, zz-statistic for green-list token counts, perplexity and ROUGE degradation (Tariqul et al., 8 Jan 2026, Ganesan, 27 Oct 2025).

Table: Example RTT Impact on Watermarking (from (Tariqul et al., 8 Jan 2026)) | Method | Detection Acc. (benign) | Detection Acc. (after RTT) | |--------|-------------------------|----------------------------| | KGW | 0.885 | 0.09 | | EXP | 0.912 | 0.13 |

5. Robust Countermeasures and RTT-Adapted Methodologies

Research shows that standard adversarial and watermarking techniques fail under RTT. Emerging countermeasures include:

  • Layered Watermarking: Embedding token-level watermarks (e.g., KGW, EXP) at generation time, then applying post-generation distributional embedding (e.g., Waterfall), yielding 3–4×\times higher post-RTT detection rates (rising to 40−50%40-50\% detection) while managing semantic drift (Tariqul et al., 8 Jan 2026). Semantic similarity drops from ≈0.88\approx 0.88 to ≈0.80\approx 0.80 can be tightly controlled.
  • RTT-Robust Adversarial Generation: NMT-Text-Attack constrains perturbations to retain adversarial status after multi-lingual RTT, as enforced by ∀ℓ∈{â„“1,…,â„“m}, V(RTTâ„“(xadv))≠y\forall \ell \in\{\ell_1,\ldots,\ell_m\},\ V(\mathrm{RTT}_\ell(x_\mathrm{adv}))\ne y (Bhandari et al., 2023).
  • Doubly Round-Trip Training: Directly training models on authentic bilingual adversarial pairs crafted with DRTT to enhance resilience against both monolingual and cross-lingual perturbations (Lai et al., 2022).

Empirical analysis demonstrates:

  • Attack Effectiveness: Classic text adversarial attacks lose 60−70%60-70\% of effectiveness post-RTT (e.g., only 34.8%34.8\% of adversarial examples succeed post single RTT across three languages) (Bhandari et al., 2023).
  • Watermark Removal: CLSA attacks drive watermark AUROC to chance on sophisticated schemes (XSIR: $0.53$; KGW on Spanish: $0.511$) (Ganesan, 27 Oct 2025).
  • Defensive Gains: Layered watermarking improves post-RTT detection from 9−13%9-13\% to 40−50%40-50\% (Tariqul et al., 8 Jan 2026). DRTT-based adversarial pairs improve BLEU scores under noise over single-RTT by +0.4+0.4–+1.1+1.1, without harming clean performance (Lai et al., 2022).
  • Semantic Fidelity: Robust RTT-adapted attacks induce marginal additional semantic drift (Δ<0.05\Delta < 0.05 on USE/BERTScore) compared to unconstrained baseline methods (Bhandari et al., 2023).

7. Future Directions and Theoretical Implications

The collapse of token-level watermarking and monolingual adversarial effectiveness under RTT attacks exposes the need for fundamentally new approaches:

  • Hybrid and Semantic-Invariant Watermarking: Incorporation of distributional and cryptographic signals or model attestation, and invariance to paraphrasing and cross-lingual transformation (Ganesan, 27 Oct 2025).
  • Benchmarks for Multilingual Robustness: Standardized evaluation protocols embedding RTT pathways and semantic similarity constraints, especially for low-resource languages (Tariqul et al., 8 Jan 2026).
  • Algorithmic Advances: DRTT, NMT-Text-Attack, and layered schemes highlight the importance of jointly optimizing for cross-lingual and semantic bottleneck resilience.

A plausible implication is that both adversarial robustness research and watermark design will increasingly require joint modeling of linguistic, distributional, and cryptographic constraints to maintain effectiveness under black-box, global, and semantically compressive transformation regimes (Lai et al., 2022, Ganesan, 27 Oct 2025, Tariqul et al., 8 Jan 2026, Bhandari et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Lingual Round-Trip Translation (RTT) Attacks.