SITA-R1 Pipeline for Tonal ASR Adaptation

Updated 27 February 2026

The paper introduces the SITA-R1 Pipeline, a two-stage adaptation approach combining cross-gender contrastive and tone-aware losses to enhance XLS-R for low-resource tonal ASR.
The methodology first freezes lower transformer blocks for robust representation learning and then fine-tunes using CTC loss with optional knowledge distillation.
The pipeline achieves significant improvements in cross-gender retrieval and tone separation while generalizing effectively across Hmong and Mandarin tonal datasets.

SITA-R1 Pipeline refers to the two-stage adaptation framework introduced in "SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages" (Xu et al., 14 Jan 2026). The pipeline is designed to enhance wav2vec-style encoders—specifically XLS-R—by imposing speaker-invariance and tone-awareness for low-resource tonal ASR and representation learning. The approach deploys a curriculum in which a contrastive/tone-aware representation stage is followed by ASR fine-tuning with connectionist temporal classification (CTC) and optional knowledge distillation.

1. Pipeline Architecture and Overview

SITA-R1 adapts a pretrained XLS-R encoder through two sequential stages:

Stage I: Representation Learning
- Freeze the lowest $B$ transformer blocks.
- Update blocks $B\!+\!1,\dots,L$ .
- At layer $L$ , extract an $\ell_2$ -normalized frame-level representation $z$ .
- Multi-objective loss:
- Cross-gender contrastive loss for speaker-invariance.
- Tone-repulsive InfoNCE + tone classification for tone-awareness.
Stage II: ASR Fine-Tuning
- Freeze blocks $1,\dots,L$ from Stage I.
- Update blocks $L\!+\!1...M$ (top stack) and the CTC head.
- Optimize a weighted combination of CTC loss and knowledge distillation (if a teacher is available).

The block structure is as follows (for XLS-R; $M=24$ layers):

1
2
3

XLS-R: blocks 1...B (frozen)  →  blocks B+1...L (updated)  →  blocks L+1...24 (frozen/updated in II)
          └─────────────Stage I────────────┘   └─────Stage II─────┘
at stage boundary: freeze 1...L, update L+1...24 + CTC head

This two-stage procedure enforces invariant and disentangled lexical/tone properties prior to ASR fitting (Xu et al., 14 Jan 2026).

2. Representation Learning: Cross-Gender Contrastive and Tone Objectives

2.1 Data and Embeddings

Input dataset $D = \{(x_i, y_i, t_i, g_i)\}_{i=1}^M$ $D = {(x_{i}, y_{i}, t_{i}, g_{i})}_{i = 1}^{M}$ :
- $x_i$ : speech segment waveform
- $y_i$ : orthographic word label
- $t_i$ : tone label ( $t_i \in \{1, ..., T\}$ )
- $g_i$ : speaker gender (male/female)
Embedding extraction:
- Forward $x_i$ through XLS-R to layer $L$ : $h^{(L)}_i \in \mathbb{R}^d$
- Normalize: $z_i = h^{(L)}_i / \|h^{(L)}_i\|_2$

2.2 Cross-Gender Contrastive Loss ( $\mathcal{L}_{speaker}$ )

For anchor $x_i$ , positive $x_i^+$ is the same lexical item spoken by an opposite-gender speaker (via voice conversion or direct recording), negatives $x_{i,n}^-$ are different words ( $y_j \ne y_i$ ).
Compute cosine similarities (temperature $\tau_g$ ):

$s_i^+ = \frac{z_i^\top z_i^+}{\tau_g}, \quad s_{i,n}^- = \frac{z_i^\top z_{i,n}^-}{\tau_g}$

InfoNCE loss (per sample):

$\ell_1(x_i) = -\log \frac{\exp(s_i^+)}{\exp(s_i^+) + \sum_{n=1}^N \exp(s_{i,n}^-)}$

Aggregate:

$\mathcal{L}_{speaker}(D) = \frac{1}{M} \sum_{i=1}^M \ell_1(x_i)$

2.3 Tone-Repulsive + Tone-Classification Loss ( $\mathcal{L}_{tone}$ )

For anchor $x_i$ $x_{i}$ :
- Positives $P_i^+$ : same word, same tone.
- Hard negatives $H_i$ : same word, different tone.
- Soft negatives $N_i$ : different word.
Similarities (temperature $\tau_t$ ):

$s_{ij} = \frac{z_i^\top z_j}{\tau_t}, \quad Z_i = \sum_{j \in P_i^+ \cup H_i \cup N_i} \exp(s_{ij})$

Repulsive InfoNCE:

$\ell_2(x_i) = -\frac{1}{|P_i^+|} \sum_{j \in P_i^+} \log \frac{\exp(s_{ij})}{Z_i}$

Tone classification head $p_\varphi(t | z) = \mathrm{softmax}(W_{cls} z + b_{cls})$ :

$\ell_3(x_i, t_i) = -\log p_\varphi(t_i|z_i)$

Combine:

$\mathcal{L}_{tone}(D) = \frac{1}{M} \sum_{i=1}^M [\ell_2(x_i) + \lambda_{cls} \ell_3(x_i, t_i)]$

2.4 Stage I Objective

$\mathcal{L}_{Stage1} = \alpha \cdot \mathcal{L}_{speaker}(D) + (1-\alpha) \cdot \mathcal{L}_{tone}(D)$

Default hyperparameters: $\alpha = 0.5$ , $\tau_g = \tau_t = 0.07$ , $\lambda_{cls} = 1.0$ , $N = 20$ negatives.

3. ASR Fine-Tuning: CTC and Knowledge Distillation

3.1 CTC Loss

With blocks $1..L$ frozen, adapt blocks $L+1..24$ and the CTC head $W_{ctc}$ . For sequence $x$ and transcription $y$ :

CTC log-likelihood:

$p_\theta(y|x) = \sum_{\pi: B(\pi)=y} \prod_{t=1}^T p_\theta(\pi_t|x)$

Empirical CTC loss:

$\mathcal{L}_{CTC}(D_{asr}) = -\frac{1}{|D_{asr}|} \sum_{(x_i,y_i)} \log p_\theta(y_i|x_i)$

3.2 Knowledge Distillation

With a frozen teacher ( $\varphi$ ) and temperature $\tau_{kd}$ : $\tilde p_\varphi(\pi_t|x) = \mathrm{softmax}(o_t'/\tau_{kd}), \quad \tilde p_\theta(\pi_t|x) = \mathrm{softmax}(o_t/\tau_{kd})$ Per-frame KL loss: $\mathcal{L}_{KD}(D_{asr}) = \frac{1}{|D_{asr}|} \sum_{(x_i)} \sum_{t=1}^T \mathrm{KL}[\tilde p_\varphi(\cdot|x_i) \| \tilde p_\theta(\cdot|x_i)]$

3.3 Stage II Objective

$\mathcal{L}_{Stage2} = \delta \cdot \mathcal{L}_{CTC} + (1-\delta)\cdot \mathcal{L}_{KD}$

Default: $\delta=0.7$ , $\tau_{kd}=3.0$ if KD is used; $\delta=1.0$ if not.

4. Training Data, Preprocessing, and Optimization

4.1 Corpora

Hmong Word-Level Corpus (WRT)
- 8,570 tokens (3F/5M), 1,143 unique words, 163 base-words $\times$ 7 tones.
- Voice-converted augmentation adds 3,600 cross-gender tokens.
- Word-level VAD segmentation; “unseen speaker” splits hold out one male as query.
Mandarin ("Tone Perfect")
- 9,840 monosyllabic tokens, 6 speakers, 410 syllables $\times$ 4 tones.
- One male held out for test.

4.2 Preprocessing and Augmentation

Resample to 16 kHz, peak normalization.
On-the-fly perturbation: additive noise, time-stretch, gain jitter.
No pitch shift for tone-sensitive tasks.
FreeVC voice conversion applied offline for cross-gender variants.

4.3 Optimization

Adam optimizer, learning rate $5{\times}10^{-4}$ , weight decay $10^{-2}$ , gradient clipping at 1.0.
Effective batch size $16$ (4 segments $\times$ 4-step accumulation).
1,200-step warmup, 12,000 updates per stage.

5. Evaluation Protocols and Results

Metric	Hmong SITA-R1	XLS-R ASR Teacher
Top-1 Retrieval (F→M / M→F)	0.629 / 0.593	0.329 / 0.461
Top-5 Retrieval (F→M / M→F)	0.929 / 0.889	0.671 / 0.779
Unseen M→F Top-1 / Top-5	0.687 / 0.908	0.545 / 0.873
Tone Geometry: PosSim	0.80	≈0.99
Tone Geometry: Hard NegDist	0.675	≈0.01
Tone Geometry: Soft NegDist	0.940	—
CER / WER (SITA vs Teacher)	0.1985/0.5115	0.1835/0.4610

Mandarin transfer performance: Top-1 retrieval (M→F): SITA 0.993 vs. ASR-XLS-R 0.962; CER/WER with KD: 0.0073/0.0280.
SITA-R1 produces large improvement in cross-gender retrieval and separation between same-word, different-tone embeddings, with minimal ASR degradation. The approach transfers with no hyperparameter tuning to Mandarin (Xu et al., 14 Jan 2026).

6. Practical Significance and Generality

SITA-R1 demonstrates that sequential multi-objective adaptation of multilingual encoders can generate representations robust to speaker and sensitive to tone, crucial for tonal language ASR under severe data constraints. The staged design, which first achieves the desired geometry and then fits ASR decoding, prevents the collapse of tone-sensitive representations typical in naïve fine-tuning. The method's generalization across Hmong and Mandarin, with identical settings, supports applicability to other low-resource tonal languages.

7. Relation to SITA and Other "R1" Pipelines

No "SITA-R1" variant is defined in the context of adversarial attacks for stylized image generation (Kang et al., 25 Mar 2025). In the SITA speech-adaptation context, SITA-R1 exclusively refers to the two-stage speaker-invariant, tone-aware pipeline (Xu et al., 14 Jan 2026). Other ablation variants discussed for SITA in image-space (e.g., alternative destylization losses) are not denoted “R1.” Thus, in this research area, SITA-R1 is unambiguously associated with low-resource tonal speech adaptation, not adversarial image protection or vision-language grounding.

Markdown Report Issue Upgrade to Chat

References (2)

SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages (2026)

SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SITA-R1 Pipeline.

SITA-R1 Pipeline for Tonal ASR Adaptation

1. Pipeline Architecture and Overview

2. Representation Learning: Cross-Gender Contrastive and Tone Objectives

2.1 Data and Embeddings

2.2 Cross-Gender Contrastive Loss ( $\mathcal{L}_{speaker}$ )

2.3 Tone-Repulsive + Tone-Classification Loss ( $\mathcal{L}_{tone}$ )

2.4 Stage I Objective

3. ASR Fine-Tuning: CTC and Knowledge Distillation

3.1 CTC Loss

3.2 Knowledge Distillation

3.3 Stage II Objective

4. Training Data, Preprocessing, and Optimization

4.1 Corpora

4.2 Preprocessing and Augmentation

4.3 Optimization

5. Evaluation Protocols and Results

6. Practical Significance and Generality

7. Relation to SITA and Other "R1" Pipelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SITA-R1 Pipeline for Tonal ASR Adaptation

1. Pipeline Architecture and Overview

2. Representation Learning: Cross-Gender Contrastive and Tone Objectives

2.1 Data and Embeddings

2.2 Cross-Gender Contrastive Loss (Lspeaker\mathcal{L}_{speaker}Lspeaker​)

2.3 Tone-Repulsive + Tone-Classification Loss (Ltone\mathcal{L}_{tone}Ltone​)

2.4 Stage I Objective

3. ASR Fine-Tuning: CTC and Knowledge Distillation

3.1 CTC Loss

3.2 Knowledge Distillation

3.3 Stage II Objective

4. Training Data, Preprocessing, and Optimization

4.1 Corpora

4.2 Preprocessing and Augmentation

4.3 Optimization

5. Evaluation Protocols and Results

6. Practical Significance and Generality

7. Relation to SITA and Other "R1" Pipelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

2.2 Cross-Gender Contrastive Loss ( $\mathcal{L}_{speaker}$ )

2.3 Tone-Repulsive + Tone-Classification Loss ( $\mathcal{L}_{tone}$ )