Canary-1B-v2: Efficient Multilingual ASR & AST
- Canary-1B-v2 is a multilingual speech-to-text and translation model with an encoder-decoder structure and FastConformer backbone, optimized for high accuracy and speed in ASR and AST tasks.
- It employs a two-stage pre-training and fine-tuning process on 1.7M hours of audio data, using dynamic data balancing and cosine scheduling to enhance multilingual performance.
- Benchmark results show an average WER of 8.1% in ASR with competitive AST performance, achieving up to 10× faster inference compared to larger models.
Canary-1B-v2 is a multilingual speech-to-text and translation model designed for efficient, robust, and high-performance automatic speech recognition (ASR) and speech-to-text translation (AST) across 25 primarily European languages. Developed on an encoder–decoder architecture with a FastConformer backbone and trained on over 1.7 million hours of audio, Canary-1B-v2 combines large-scale weakly supervised pre-training, targeted fine-tuning, and advanced timestamping methods to deliver state-of-the-art accuracy and throughput in both ASR and AST tasks, competitive with much larger models.
1. Architectural Design
Canary-1B-v2 utilizes an encoder–decoder framework tailored to both ASR and AST:
- Encoder: The default architecture employs FastConformer, a variant of Conformer optimized for speech with:
- 8× subsampling via convolutional blocks for sequence shortening and computational gain.
- Depthwise separable convolutions and lightweight convolution modules to minimize operations.
- The model also explores an alternative encoder based on normalized GPT (nGPT), which applies hyperspherical normalization and utilizes positional encoding strategies such as Rotary Positional Embeddings (RoPE) or modified ALiBi for enhanced generalization over long sequences.
- Decoder: A standard autoregressive Transformer decoder models cross-lingual and contextual dependencies for robust text generation.
- Tokenization: A shared BPE tokenizer, trained on the model’s full multilingual dataset, supports code switching and linguistic diversity with a unified lexical space.
This architecture enables rapid inference (RTFx ~749), strong performance on large-scale speech data, and flexibility across multiple languages and tasks.
2. Training Procedure and Data Balancing
Canary-1B-v2 employs a two-stage pre-training and fine-tuning regime:
- Stage 1 Pre-training: The model learns core speech-text representations from approximately 1.7M hours of data, including:
- 360,000 hours of X→En translation pairs.
- 285,000 hours of English ASR data.
- 36,000 hours of non-speech audio with empty-string targets for hallucination reduction.
- Aggressive data bucketing, optimizer advancements (including techniques akin to OOMptimizer for GPU utilization), and initial weighting () for language and corpus balancing.
- Stage 2 Pre-training: Further training on the comprehensive data blend for 100,000 additional steps improves cross-task transfer.
- Fine-tuning:
- Focuses on high-quality data: NeMo ASR Set 3.0, curated Granary data, and supplementary En→X translation resources.
- Dynamic Data Balancing: Corpus sampling per language is adjusted according to , with , and an overall upsampling factor for languages.
- Cosine scheduling drives the model toward uniform weighting as fine-tuning proceeds, allowing adaptation to high-quality, balanced datasets.
This staged protocol enables rapid initial convergence, maximizes coverage, and systematically refines both monolingual and multilingual generation accuracy.
3. Performance and Benchmarking
Canary-1B-v2 delivers state-of-the-art results in large-scale multilingual ASR and AST:
Model | Task | Word Error Rate (WER) | Throughput (RTFx) | Languages |
---|---|---|---|---|
Canary-1B-v2 | ASR | 8.1% (avg, multilingual) | ~749 | 25 (primarily European) |
Whisper-large-v3 | ASR | 9.9% | ~75–100 | 99+ |
Canary-1B-v2 | AST | Competitive COMET vs S-M4T | See paper | 25 pairs |
- On English ASR (e.g., Hugging Face Open ASR Leaderboard, FLEURS, CoVoST, MLS), Canary-1B-v2 surpasses Whisper-large-v3 in accuracy but is approximately 10× faster in inference.
- For AST, COMET scores outperform lighter models (e.g., SeamlessM4T-medium) and are competitively close to significantly larger models (e.g., SeamlessM4T-v2-large), even with lower parameter count.
- Evaluations include per-language tables; the model sustains low WER and high semantic fidelity across all target languages.
This indicates that architectural efficiency and strategic training can deliver industry-leading accuracy and throughput even without extreme model scaling.
4. Timestamping and Alignment
Reliable token-, word-, and segment-level timestamps are central for downstream tasks (subtitling, data curation, indexing):
- Canary-1B-v2 integrates the NeMo Forced Aligner (NFA) and an auxiliary Parakeet CTC model (600M parameters) for robust forced alignment.
- Unlike attention-matrix Dynamic Time Warping, which can produce non-monotonic and fuzzy alignment, CTC-based forced alignment leverages Viterbi decoding of transcription tokens with time frames for clear, "hard" segmentation.
- This is especially pertinent for AST, where segment-level (rather than word-level) timestamps are more robust due to differences in source/target word order.
This forced alignment strategy enables consistent, accurate time annotation, essential for high-value ASR/AST integration in production settings.
5. Comparison with Parakeet-TDT-0.6B-v3 and Other Systems
Parakeet-TDT-0.6B-v3 is introduced as a highly parameter-efficient ASR-only model:
Model | Parameters | Scope | WER (avg, multilingual) | Throughput (RTFx) |
---|---|---|---|---|
Canary-1B-v2 | ~1B | ASR, AST | 8.1% | ~749 |
Parakeet-TDT-0.6B-v3 | 600M | ASR only | Slightly higher than C1B-v2 | >3300 |
LLM-based ASR systems | 7B+ | Flexible | Higher WER | <62 |
- Parakeet-TDT-0.6B-v3 is trained solely on 660,000 hours of ASR data, supporting the same 25 languages with lower computational footprint and very competitive performance.
- It achieves an RTFx exceeding 3300 (over 54× faster than some LLM-based baselines) with only modest degradation in WER compared to Canary-1B-v2.
- This highlights that with focused data curation and dynamic balancing, high-quality ASR does not require scaling to extreme parameter counts.
6. Techniques for Hallucination Reduction
To address common failure modes in ASR and AST such as hallucination:
- Non-speech audio (36,000 hours of empty-string targets) is included in pre-training to expose the model to negative examples, thereby reducing spurious transcriptions in both ASR and AST tasks.
- The presence of weakly supervised, large-scale audio-text data in training encourages greater generalization and minimizes overfitting to spurious patterns.
This suggests effective suppression of hallucinations is possible through large-scale negative sampling and careful curriculum selection rather than extensive architectural overhauls.
7. Significance and Application Scope
- Canary-1B-v2 demonstrates that architectural efficiency, large-scale multilingual training, dynamic balancing, and robust forced alignment collectively enable state-of-the-art performance while remaining deployable in resource-constrained scenarios.
- The approach is adaptable to both broad-coverage multilingual ASR and speech translation, and with minor modifications, can be extended to additional modalities or languages.
This suggests a paradigm where models with moderate parameter counts, when paired with sophisticated training protocols and modular alignment tools, can match or surpass the output quality of much larger systems across practical, real-world tasks.
For further technical details and experimental results, see "Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST" (Sekoyan et al., 17 Sep 2025).