Papers
Topics
Authors
Recent
2000 character limit reached

SSA-HuBERT-Large: African Speech Model

Updated 5 December 2025
  • The paper introduces SSA-HuBERT-Large, a 317M self-supervised speech model that achieves a 7.3 point WER reduction on ASR tasks compared to its Base variant.
  • SSA-HuBERT-Large employs a 24-layer transformer encoder with 16 self-attention heads and integrates residual adapters for efficient accent adaptation across 18 Sub-Saharan languages.
  • Empirical evaluations demonstrate robust performance on both ASR and LID tasks, highlighting scalable adapter-based accent adaptation with minimal additional computational cost.

SSA-HuBERT-Large is a 317M parameter self-supervised speech representation model specifically developed for African languages, extending the Hidden-unit BERT (HuBERT) paradigm with both computational and data modifications to meet the unique challenges of Sub-Saharan speech technology. It is characterized by large-scale, Africa-centric pre-training, open access to model weights, and rigorous evaluation on low-resource automatic speech recognition (ASR) and language identification (LID) tasks. Notably, the SSA-HuBERT-Large model family also encompasses variants with residual adapters for accent adaptation, as well as resource-optimized implementations designed for academic compute environments.

1. Model Architecture

SSA-HuBERT-Large employs a transformer encoder with the following key properties (Caubrière et al., 28 Nov 2025):

  • Transformer Encoder:
    • 24 layers, each with hidden-state dimensionality of 1024.
    • 16 self-attention heads per layer.
  • Downstream Heads:
    • For ASR: Two 1024-unit ReLU-activated feed-forward layers, followed by a linear softmax output over the vocabulary.
    • For LID: Mean-pooled encoder outputs, optionally passed through a 1024-unit projection before softmax.
  • Parameter Counts:
    • SSA-HuBERT-Base: 95M
    • SSA-HuBERT-Large: 317M
    • SSA-HuBERT-XL: 964M

Accent-adaptive variants integrate light-weight residual adapters after each Transformer block, formalized as bottleneck feed-forward subnetworks that undergo accent-specific self-supervised tuning while freezing the base model parameters (Bhatia et al., 2023). Each adapter module size (e.g., bottleneck dimension 1024) adds approximately 16% more parameters per accent.

2. Pre-training Data and Objectives

SSA-HuBERT-Large is trained exclusively on ~60,000 hours of unlabelled speech from 18 Sub-Saharan languages. The language composition includes Swahili, Hausa, Kinyarwanda, African-accented French, Bambara, Lingala, Sango, Tamasheq, Maninkakan, Songhai, Fula, Luba-Lulua, Kituba, Zarma, Wolof, Dyula, Mossi, and Gulmancema. Data sourcing and pre-processing follow the procedures detailed in prior Africa-centric self-supervised pretraining works (Caubrière et al., 28 Nov 2025).

Pre-training uses the standard HuBERT masked prediction loss:

L=ExD[tTmasklogp(h^tx/Tmask)]L = E_{x \sim D} \left[ - \sum_{t \in T_{\text{mask}}} \log p(\hat{h}_t | x_{/T_{\text{mask}}}) \right]

where TmaskT_{\text{mask}} denotes masked time-steps, h^t\hat{h}_t is the discrete target from clustering, and x/Tmaskx_{/T_{\text{mask}}} is the unmasked input. No modifications to the core objective are introduced beyond masking and HuBERT’s cluster-based targets.

3. Training Protocols and Optimization

  • Batch size per GPU: 56.25 s of audio
  • Optimizer: Adam (β₁=0.9, β₂=0.98), weight decay=0.01
  • Learning-rate schedule: Linear warmup over the first 32,000 steps, then linear decay to zero over 450,000 total updates.
  • Hardware: 4×NVIDIA H100 GPUs, with gradient accumulation over 32 steps

ESA-HuBERT-Large pre-training strictly adheres to computational regimes feasible for academic labs, as demonstrated in resource-optimized variants that complete Large-scale pre-training within 2,880 GPU-hours (8×A100 GPUs over 15 days) (Chen et al., 2023).

4. Fine-tuning Procedures for Downstream Tasks

ASR (Connectionist Temporal Classification)

  • Toolkit: SpeechBrain
  • Architecture: Encoder from pre-trained SSA-HuBERT-Large, followed by two 1024-unit ReLU layers and a linear output layer matching the transcript vocabulary.
  • Data: FLEURS-SSA, a curated 320-hour multi-language read speech corpus in 20 Sub-Saharan languages.
  • Procedure: 40 epochs joint multilingual fine-tuning with batch size 4; single A100 GPU; monolingual transfer uses the same protocol.
  • Loss: CTC on character-level targets.

LID

  • Head Variants:
    • “LID”: mean-pooled encoder outputs to softmax
    • “LID-smooth”: mean-pooled outputs, 1024-unit feed-forward projection, softmax
  • Training: 15 epochs, batch size 4, A100 GPU
  • Loss: Cross-entropy over 20 language targets
  • Evaluation: Same FLEURS-SSA splits as ASR

Accent Adaptation

Self-supervised accent adaptation proceeds in two stages (Bhatia et al., 2023):

  • Stage 1: Generic HuBERT-Large pre-training on generic large-scale English (Libri-Light, 60K hours)
  • Stage 2: Accent-adaptive continual self-supervised learning, either:
    • Full encoder adaptation (low learning rate, updates all model parameters), or
    • Adapter-only adaptation (update adapter parameters exclusively, freeze main model), achieving most of the WER reduction at a fraction of the per-accent parameter cost.

No additional regularization beyond standard weight decay is used during adaptation.

5. Empirical Performance

Model #Params Avg. CER Avg. WER
SSA-HuBERT-Base-v2 95 M 13.0 45.1
SSA-HuBERT-Large 317 M 10.5 37.8
SSA-HuBERT-XL 964 M 10.1 37.2
AfriHuBERT-n (BASE) 95 M 11.5 40.6
XLS-R128 (LARGE) 317 M 10.8 39.2
  • SSA-HuBERT-Large yields an absolute WER reduction of 7.3 points compared to Base (45.1 → 37.8).
  • The move to XL size offers only marginal further improvement (<1 pp).
  • Maximum gains are observed on languages present in pre-training data, underscoring the importance of coverage.

LID

Model LID (%) LID-smooth (%)
AfriHuBERT-n 93.3 93.5
SSA-HuBERT-Base-v2 78.3 85.5
SSA-HuBERT-Large 89.9 93.7
SSA-HuBERT-XL 92.7 93.1
  • SSA-HuBERT-Large matches or nearly matches the AfriHuBERT-n on LID when using the “smooth” projection head.
  • Plain mean-pooling is suboptimal for LID; an added projection improves performance.
Accent Baseline Adapter WER Full WER Adapter WERR Full WERR
Indian 24.8 18.9 18.1 23.9% 27.2%
Scottish 52.0 37.2 37.4 28.5% 27.8%
German 32.0 26.3 24.7 17.7% 22.8%
Chinese 30.0 23.8 23.2 20.8% 22.7%

Adapter-only updates (≈16% of model parameters) deliver a mean WERR of 22.7%, closely tracking full encoder adaptation (25.1%).

6. Model Scaling, Efficiency, and Practical Recommendations

Empirical results demonstrate significant gains when scaling from Base (95M) to Large (317M) under low-resource fine-tuning conditions (∼16% relative WER reduction for ASR). Further scaling to XL (964M) produces diminishing returns, suggesting that once the Large scale is attained, greater focus should be placed on pre-training data diversity (additional languages and dialects) rather than further increases in model size (Caubrière et al., 28 Nov 2025).

Accent adaptation with residual adapters enables rapid extension to new accents using a limited amount of unlabeled speech, bypassing the need for transcription. These adapters require only a fraction of the total model parameters per accent, supporting scalable deployment across accent or domain boundaries (Bhatia et al., 2023). The methodology generalizes to other speech variations (children’s, pathological, environmental).

Large-scale pretraining with only academic compute resources is practical, as demonstrated by a HuBERT-Large model trained using 8 GPUs in 2,880 GPU-hours, attaining equivalent or superior performance to industrial-scale pretraining (Chen et al., 2023).

7. Broader Impact and Future Directions

SSA-HuBERT-Large delivers the first strong, open-weight encoder foundation model family trained solely on African speech data, enabling both research and deployed applications in under-represented language technology settings (Caubrière et al., 28 Nov 2025). Beyond ASR and LID, plausible extensions include spoken language understanding, speaker diarization, and domains requiring robustness to low supervision.

For classification tasks such as LID, empirical findings favor employing a “smooth” head for maximum generalization. Strategic expansions should prioritize broader language and dialect coverage, and incorporation of more morphologically rich languages.

Adapter-based approaches provide a template for cheap, scalable transfer to new speech sub-domains; the same architecture and training protocol are applicable to other axes of variation, promising broad applicability in global speech technology research (Bhatia et al., 2023).

Open problems include maximizing cross-linguistic transfer, optimizing adapter design for even greater parameter efficiency, and extending pre-training and evaluation to richer downstream speech processing tasks.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SSA-HuBERT-Large.