Papers
Topics
Authors
Recent
Search
2000 character limit reached

SITA-R1 Pipeline for Tonal ASR Adaptation

Updated 27 February 2026
  • The paper introduces the SITA-R1 Pipeline, a two-stage adaptation approach combining cross-gender contrastive and tone-aware losses to enhance XLS-R for low-resource tonal ASR.
  • The methodology first freezes lower transformer blocks for robust representation learning and then fine-tunes using CTC loss with optional knowledge distillation.
  • The pipeline achieves significant improvements in cross-gender retrieval and tone separation while generalizing effectively across Hmong and Mandarin tonal datasets.

SITA-R1 Pipeline refers to the two-stage adaptation framework introduced in "SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages" (Xu et al., 14 Jan 2026). The pipeline is designed to enhance wav2vec-style encoders—specifically XLS-R—by imposing speaker-invariance and tone-awareness for low-resource tonal ASR and representation learning. The approach deploys a curriculum in which a contrastive/tone-aware representation stage is followed by ASR fine-tuning with connectionist temporal classification (CTC) and optional knowledge distillation.

1. Pipeline Architecture and Overview

SITA-R1 adapts a pretrained XLS-R encoder through two sequential stages:

  • Stage I: Representation Learning
    • Freeze the lowest BB transformer blocks.
    • Update blocks B ⁣+ ⁣1,,LB\!+\!1,\dots,L.
    • At layer LL, extract an 2\ell_2-normalized frame-level representation zz.
    • Multi-objective loss:
    • Cross-gender contrastive loss for speaker-invariance.
    • Tone-repulsive InfoNCE + tone classification for tone-awareness.
  • Stage II: ASR Fine-Tuning
    • Freeze blocks 1,,L1,\dots,L from Stage I.
    • Update blocks L ⁣+ ⁣1...ML\!+\!1...M (top stack) and the CTC head.
    • Optimize a weighted combination of CTC loss and knowledge distillation (if a teacher is available).

The block structure is as follows (for XLS-R; M=24M=24 layers):

1
2
3
XLS-R: blocks 1...B (frozen)  →  blocks B+1...L (updated)  →  blocks L+1...24 (frozen/updated in II)
          └─────────────Stage I────────────┘   └─────Stage II─────┘
at stage boundary: freeze 1...L, update L+1...24 + CTC head
This two-stage procedure enforces invariant and disentangled lexical/tone properties prior to ASR fitting (Xu et al., 14 Jan 2026).

2. Representation Learning: Cross-Gender Contrastive and Tone Objectives

2.1 Data and Embeddings

  • Input dataset D={(xi,yi,ti,gi)}i=1MD = \{(x_i, y_i, t_i, g_i)\}_{i=1}^M:
    • xix_i: speech segment waveform
    • yiy_i: orthographic word label
    • tit_i: tone label (ti{1,...,T}t_i \in \{1, ..., T\})
    • gig_i: speaker gender (male/female)
  • Embedding extraction:
    • Forward xix_i through XLS-R to layer LL: hi(L)Rdh^{(L)}_i \in \mathbb{R}^d
    • Normalize: zi=hi(L)/hi(L)2z_i = h^{(L)}_i / \|h^{(L)}_i\|_2

2.2 Cross-Gender Contrastive Loss (Lspeaker\mathcal{L}_{speaker})

  • For anchor xix_i, positive xi+x_i^+ is the same lexical item spoken by an opposite-gender speaker (via voice conversion or direct recording), negatives xi,nx_{i,n}^- are different words (yjyiy_j \ne y_i).
  • Compute cosine similarities (temperature τg\tau_g):

si+=zizi+τg,si,n=zizi,nτgs_i^+ = \frac{z_i^\top z_i^+}{\tau_g}, \quad s_{i,n}^- = \frac{z_i^\top z_{i,n}^-}{\tau_g}

1(xi)=logexp(si+)exp(si+)+n=1Nexp(si,n)\ell_1(x_i) = -\log \frac{\exp(s_i^+)}{\exp(s_i^+) + \sum_{n=1}^N \exp(s_{i,n}^-)}

  • Aggregate:

Lspeaker(D)=1Mi=1M1(xi)\mathcal{L}_{speaker}(D) = \frac{1}{M} \sum_{i=1}^M \ell_1(x_i)

2.3 Tone-Repulsive + Tone-Classification Loss (Ltone\mathcal{L}_{tone})

  • For anchor xix_i:
    • Positives Pi+P_i^+: same word, same tone.
    • Hard negatives HiH_i: same word, different tone.
    • Soft negatives NiN_i: different word.
  • Similarities (temperature τt\tau_t):

sij=zizjτt,Zi=jPi+HiNiexp(sij)s_{ij} = \frac{z_i^\top z_j}{\tau_t}, \quad Z_i = \sum_{j \in P_i^+ \cup H_i \cup N_i} \exp(s_{ij})

  • Repulsive InfoNCE:

2(xi)=1Pi+jPi+logexp(sij)Zi\ell_2(x_i) = -\frac{1}{|P_i^+|} \sum_{j \in P_i^+} \log \frac{\exp(s_{ij})}{Z_i}

  • Tone classification head pφ(tz)=softmax(Wclsz+bcls)p_\varphi(t | z) = \mathrm{softmax}(W_{cls} z + b_{cls}):

3(xi,ti)=logpφ(tizi)\ell_3(x_i, t_i) = -\log p_\varphi(t_i|z_i)

  • Combine:

Ltone(D)=1Mi=1M[2(xi)+λcls3(xi,ti)]\mathcal{L}_{tone}(D) = \frac{1}{M} \sum_{i=1}^M [\ell_2(x_i) + \lambda_{cls} \ell_3(x_i, t_i)]

2.4 Stage I Objective

LStage1=αLspeaker(D)+(1α)Ltone(D)\mathcal{L}_{Stage1} = \alpha \cdot \mathcal{L}_{speaker}(D) + (1-\alpha) \cdot \mathcal{L}_{tone}(D)

Default hyperparameters: α=0.5\alpha = 0.5, τg=τt=0.07\tau_g = \tau_t = 0.07, λcls=1.0\lambda_{cls} = 1.0, N=20N = 20 negatives.

3. ASR Fine-Tuning: CTC and Knowledge Distillation

3.1 CTC Loss

With blocks $1..L$ frozen, adapt blocks L+1..24L+1..24 and the CTC head WctcW_{ctc}. For sequence xx and transcription yy:

  • CTC log-likelihood:

pθ(yx)=π:B(π)=yt=1Tpθ(πtx)p_\theta(y|x) = \sum_{\pi: B(\pi)=y} \prod_{t=1}^T p_\theta(\pi_t|x)

  • Empirical CTC loss:

LCTC(Dasr)=1Dasr(xi,yi)logpθ(yixi)\mathcal{L}_{CTC}(D_{asr}) = -\frac{1}{|D_{asr}|} \sum_{(x_i,y_i)} \log p_\theta(y_i|x_i)

3.2 Knowledge Distillation

With a frozen teacher (φ\varphi) and temperature τkd\tau_{kd}: p~φ(πtx)=softmax(ot/τkd),p~θ(πtx)=softmax(ot/τkd)\tilde p_\varphi(\pi_t|x) = \mathrm{softmax}(o_t'/\tau_{kd}), \quad \tilde p_\theta(\pi_t|x) = \mathrm{softmax}(o_t/\tau_{kd}) Per-frame KL loss: LKD(Dasr)=1Dasr(xi)t=1TKL[p~φ(xi)p~θ(xi)]\mathcal{L}_{KD}(D_{asr}) = \frac{1}{|D_{asr}|} \sum_{(x_i)} \sum_{t=1}^T \mathrm{KL}[\tilde p_\varphi(\cdot|x_i) \| \tilde p_\theta(\cdot|x_i)]

3.3 Stage II Objective

LStage2=δLCTC+(1δ)LKD\mathcal{L}_{Stage2} = \delta \cdot \mathcal{L}_{CTC} + (1-\delta)\cdot \mathcal{L}_{KD}

Default: δ=0.7\delta=0.7, τkd=3.0\tau_{kd}=3.0 if KD is used; δ=1.0\delta=1.0 if not.

4. Training Data, Preprocessing, and Optimization

4.1 Corpora

  • Hmong Word-Level Corpus (WRT)
    • 8,570 tokens (3F/5M), 1,143 unique words, 163 base-words ×\times 7 tones.
    • Voice-converted augmentation adds 3,600 cross-gender tokens.
    • Word-level VAD segmentation; “unseen speaker” splits hold out one male as query.
  • Mandarin ("Tone Perfect")
    • 9,840 monosyllabic tokens, 6 speakers, 410 syllables ×\times 4 tones.
    • One male held out for test.

4.2 Preprocessing and Augmentation

  • Resample to 16 kHz, peak normalization.
  • On-the-fly perturbation: additive noise, time-stretch, gain jitter.
  • No pitch shift for tone-sensitive tasks.
  • FreeVC voice conversion applied offline for cross-gender variants.

4.3 Optimization

  • Adam optimizer, learning rate 5×1045{\times}10^{-4}, weight decay 10210^{-2}, gradient clipping at 1.0.
  • Effective batch size $16$ (4 segments ×\times 4-step accumulation).
  • 1,200-step warmup, 12,000 updates per stage.

5. Evaluation Protocols and Results

Metric Hmong SITA-R1 XLS-R ASR Teacher
Top-1 Retrieval (F→M / M→F) 0.629 / 0.593 0.329 / 0.461
Top-5 Retrieval (F→M / M→F) 0.929 / 0.889 0.671 / 0.779
Unseen M→F Top-1 / Top-5 0.687 / 0.908 0.545 / 0.873
Tone Geometry: PosSim 0.80 ≈0.99
Tone Geometry: Hard NegDist 0.675 ≈0.01
Tone Geometry: Soft NegDist 0.940
CER / WER (SITA vs Teacher) 0.1985/0.5115 0.1835/0.4610
  • Mandarin transfer performance: Top-1 retrieval (M→F): SITA 0.993 vs. ASR-XLS-R 0.962; CER/WER with KD: 0.0073/0.0280.
  • SITA-R1 produces large improvement in cross-gender retrieval and separation between same-word, different-tone embeddings, with minimal ASR degradation. The approach transfers with no hyperparameter tuning to Mandarin (Xu et al., 14 Jan 2026).

6. Practical Significance and Generality

SITA-R1 demonstrates that sequential multi-objective adaptation of multilingual encoders can generate representations robust to speaker and sensitive to tone, crucial for tonal language ASR under severe data constraints. The staged design, which first achieves the desired geometry and then fits ASR decoding, prevents the collapse of tone-sensitive representations typical in naïve fine-tuning. The method's generalization across Hmong and Mandarin, with identical settings, supports applicability to other low-resource tonal languages.

7. Relation to SITA and Other "R1" Pipelines

No "SITA-R1" variant is defined in the context of adversarial attacks for stylized image generation (Kang et al., 25 Mar 2025). In the SITA speech-adaptation context, SITA-R1 exclusively refers to the two-stage speaker-invariant, tone-aware pipeline (Xu et al., 14 Jan 2026). Other ablation variants discussed for SITA in image-space (e.g., alternative destylization losses) are not denoted “R1.” Thus, in this research area, SITA-R1 is unambiguously associated with low-resource tonal speech adaptation, not adversarial image protection or vision-language grounding.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SITA-R1 Pipeline.