SITA-R1 Pipeline for Tonal ASR Adaptation
- The paper introduces the SITA-R1 Pipeline, a two-stage adaptation approach combining cross-gender contrastive and tone-aware losses to enhance XLS-R for low-resource tonal ASR.
- The methodology first freezes lower transformer blocks for robust representation learning and then fine-tunes using CTC loss with optional knowledge distillation.
- The pipeline achieves significant improvements in cross-gender retrieval and tone separation while generalizing effectively across Hmong and Mandarin tonal datasets.
SITA-R1 Pipeline refers to the two-stage adaptation framework introduced in "SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages" (Xu et al., 14 Jan 2026). The pipeline is designed to enhance wav2vec-style encoders—specifically XLS-R—by imposing speaker-invariance and tone-awareness for low-resource tonal ASR and representation learning. The approach deploys a curriculum in which a contrastive/tone-aware representation stage is followed by ASR fine-tuning with connectionist temporal classification (CTC) and optional knowledge distillation.
1. Pipeline Architecture and Overview
SITA-R1 adapts a pretrained XLS-R encoder through two sequential stages:
- Stage I: Representation Learning
- Freeze the lowest transformer blocks.
- Update blocks .
- At layer , extract an -normalized frame-level representation .
- Multi-objective loss:
- Cross-gender contrastive loss for speaker-invariance.
- Tone-repulsive InfoNCE + tone classification for tone-awareness.
- Stage II: ASR Fine-Tuning
- Freeze blocks from Stage I.
- Update blocks (top stack) and the CTC head.
- Optimize a weighted combination of CTC loss and knowledge distillation (if a teacher is available).
The block structure is as follows (for XLS-R; layers):
1 2 3 |
XLS-R: blocks 1...B (frozen) → blocks B+1...L (updated) → blocks L+1...24 (frozen/updated in II)
└─────────────Stage I────────────┘ └─────Stage II─────┘
at stage boundary: freeze 1...L, update L+1...24 + CTC head |
2. Representation Learning: Cross-Gender Contrastive and Tone Objectives
2.1 Data and Embeddings
- Input dataset :
- : speech segment waveform
- : orthographic word label
- : tone label ()
- : speaker gender (male/female)
- Embedding extraction:
- Forward through XLS-R to layer :
- Normalize:
2.2 Cross-Gender Contrastive Loss ()
- For anchor , positive is the same lexical item spoken by an opposite-gender speaker (via voice conversion or direct recording), negatives are different words ().
- Compute cosine similarities (temperature ):
- InfoNCE loss (per sample):
- Aggregate:
2.3 Tone-Repulsive + Tone-Classification Loss ()
- For anchor :
- Positives : same word, same tone.
- Hard negatives : same word, different tone.
- Soft negatives : different word.
- Similarities (temperature ):
- Repulsive InfoNCE:
- Tone classification head :
- Combine:
2.4 Stage I Objective
Default hyperparameters: , , , negatives.
3. ASR Fine-Tuning: CTC and Knowledge Distillation
3.1 CTC Loss
With blocks $1..L$ frozen, adapt blocks and the CTC head . For sequence and transcription :
- CTC log-likelihood:
- Empirical CTC loss:
3.2 Knowledge Distillation
With a frozen teacher () and temperature : Per-frame KL loss:
3.3 Stage II Objective
Default: , if KD is used; if not.
4. Training Data, Preprocessing, and Optimization
4.1 Corpora
- Hmong Word-Level Corpus (WRT)
- 8,570 tokens (3F/5M), 1,143 unique words, 163 base-words 7 tones.
- Voice-converted augmentation adds 3,600 cross-gender tokens.
- Word-level VAD segmentation; “unseen speaker” splits hold out one male as query.
- Mandarin ("Tone Perfect")
- 9,840 monosyllabic tokens, 6 speakers, 410 syllables 4 tones.
- One male held out for test.
4.2 Preprocessing and Augmentation
- Resample to 16 kHz, peak normalization.
- On-the-fly perturbation: additive noise, time-stretch, gain jitter.
- No pitch shift for tone-sensitive tasks.
- FreeVC voice conversion applied offline for cross-gender variants.
4.3 Optimization
- Adam optimizer, learning rate , weight decay , gradient clipping at 1.0.
- Effective batch size $16$ (4 segments 4-step accumulation).
- 1,200-step warmup, 12,000 updates per stage.
5. Evaluation Protocols and Results
| Metric | Hmong SITA-R1 | XLS-R ASR Teacher |
|---|---|---|
| Top-1 Retrieval (F→M / M→F) | 0.629 / 0.593 | 0.329 / 0.461 |
| Top-5 Retrieval (F→M / M→F) | 0.929 / 0.889 | 0.671 / 0.779 |
| Unseen M→F Top-1 / Top-5 | 0.687 / 0.908 | 0.545 / 0.873 |
| Tone Geometry: PosSim | 0.80 | ≈0.99 |
| Tone Geometry: Hard NegDist | 0.675 | ≈0.01 |
| Tone Geometry: Soft NegDist | 0.940 | — |
| CER / WER (SITA vs Teacher) | 0.1985/0.5115 | 0.1835/0.4610 |
- Mandarin transfer performance: Top-1 retrieval (M→F): SITA 0.993 vs. ASR-XLS-R 0.962; CER/WER with KD: 0.0073/0.0280.
- SITA-R1 produces large improvement in cross-gender retrieval and separation between same-word, different-tone embeddings, with minimal ASR degradation. The approach transfers with no hyperparameter tuning to Mandarin (Xu et al., 14 Jan 2026).
6. Practical Significance and Generality
SITA-R1 demonstrates that sequential multi-objective adaptation of multilingual encoders can generate representations robust to speaker and sensitive to tone, crucial for tonal language ASR under severe data constraints. The staged design, which first achieves the desired geometry and then fits ASR decoding, prevents the collapse of tone-sensitive representations typical in naïve fine-tuning. The method's generalization across Hmong and Mandarin, with identical settings, supports applicability to other low-resource tonal languages.
7. Relation to SITA and Other "R1" Pipelines
No "SITA-R1" variant is defined in the context of adversarial attacks for stylized image generation (Kang et al., 25 Mar 2025). In the SITA speech-adaptation context, SITA-R1 exclusively refers to the two-stage speaker-invariant, tone-aware pipeline (Xu et al., 14 Jan 2026). Other ablation variants discussed for SITA in image-space (e.g., alternative destylization losses) are not denoted “R1.” Thus, in this research area, SITA-R1 is unambiguously associated with low-resource tonal speech adaptation, not adversarial image protection or vision-language grounding.