Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 166 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Canary-1B-v2: Efficient Multilingual ASR & AST

Updated 18 September 2025
  • Canary-1B-v2 is a multilingual speech-to-text and translation model with an encoder-decoder structure and FastConformer backbone, optimized for high accuracy and speed in ASR and AST tasks.
  • It employs a two-stage pre-training and fine-tuning process on 1.7M hours of audio data, using dynamic data balancing and cosine scheduling to enhance multilingual performance.
  • Benchmark results show an average WER of 8.1% in ASR with competitive AST performance, achieving up to 10× faster inference compared to larger models.

Canary-1B-v2 is a multilingual speech-to-text and translation model designed for efficient, robust, and high-performance automatic speech recognition (ASR) and speech-to-text translation (AST) across 25 primarily European languages. Developed on an encoder–decoder architecture with a FastConformer backbone and trained on over 1.7 million hours of audio, Canary-1B-v2 combines large-scale weakly supervised pre-training, targeted fine-tuning, and advanced timestamping methods to deliver state-of-the-art accuracy and throughput in both ASR and AST tasks, competitive with much larger models.

1. Architectural Design

Canary-1B-v2 utilizes an encoder–decoder framework tailored to both ASR and AST:

  • Encoder: The default architecture employs FastConformer, a variant of Conformer optimized for speech with:
    • 8× subsampling via convolutional blocks for sequence shortening and computational gain.
    • Depthwise separable convolutions and lightweight convolution modules to minimize operations.
  • The model also explores an alternative encoder based on normalized GPT (nGPT), which applies hyperspherical normalization and utilizes positional encoding strategies such as Rotary Positional Embeddings (RoPE) or modified ALiBi for enhanced generalization over long sequences.
  • Decoder: A standard autoregressive Transformer decoder models cross-lingual and contextual dependencies for robust text generation.
  • Tokenization: A shared BPE tokenizer, trained on the model’s full multilingual dataset, supports code switching and linguistic diversity with a unified lexical space.

This architecture enables rapid inference (RTFx ~749), strong performance on large-scale speech data, and flexibility across multiple languages and tasks.

2. Training Procedure and Data Balancing

Canary-1B-v2 employs a two-stage pre-training and fine-tuning regime:

  1. Stage 1 Pre-training: The model learns core speech-text representations from approximately 1.7M hours of data, including:
    • 360,000 hours of X→En translation pairs.
    • 285,000 hours of English ASR data.
    • 36,000 hours of non-speech audio with empty-string targets for hallucination reduction.
    • Aggressive data bucketing, optimizer advancements (including techniques akin to OOMptimizer for GPU utilization), and initial weighting (α=β=0.5\alpha=\beta=0.5) for language and corpus balancing.
  2. Stage 2 Pre-training: Further training on the comprehensive data blend for 100,000 additional steps improves cross-task transfer.
  3. Fine-tuning:
    • Focuses on high-quality data: NeMo ASR Set 3.0, curated Granary data, and supplementary En→X translation resources.
    • Dynamic Data Balancing: Corpus sampling per language is adjusted according to pcl=wc/clwcp_{c|l} = w_{c} / \sum_{c \in l} w_{c}, with wc=(n(c)/Nl)αw_{c} = (n(c)/N_l)^\alpha, and an overall upsampling factor for languages.
    • Cosine scheduling drives the model toward uniform weighting as fine-tuning proceeds, allowing adaptation to high-quality, balanced datasets.

This staged protocol enables rapid initial convergence, maximizes coverage, and systematically refines both monolingual and multilingual generation accuracy.

3. Performance and Benchmarking

Canary-1B-v2 delivers state-of-the-art results in large-scale multilingual ASR and AST:

Model Task Word Error Rate (WER) Throughput (RTFx) Languages
Canary-1B-v2 ASR 8.1% (avg, multilingual) ~749 25 (primarily European)
Whisper-large-v3 ASR 9.9% ~75–100 99+
Canary-1B-v2 AST Competitive COMET vs S-M4T See paper 25 pairs
  • On English ASR (e.g., Hugging Face Open ASR Leaderboard, FLEURS, CoVoST, MLS), Canary-1B-v2 surpasses Whisper-large-v3 in accuracy but is approximately 10× faster in inference.
  • For AST, COMET scores outperform lighter models (e.g., SeamlessM4T-medium) and are competitively close to significantly larger models (e.g., SeamlessM4T-v2-large), even with lower parameter count.
  • Evaluations include per-language tables; the model sustains low WER and high semantic fidelity across all target languages.

This indicates that architectural efficiency and strategic training can deliver industry-leading accuracy and throughput even without extreme model scaling.

4. Timestamping and Alignment

Reliable token-, word-, and segment-level timestamps are central for downstream tasks (subtitling, data curation, indexing):

  • Canary-1B-v2 integrates the NeMo Forced Aligner (NFA) and an auxiliary Parakeet CTC model (600M parameters) for robust forced alignment.
  • Unlike attention-matrix Dynamic Time Warping, which can produce non-monotonic and fuzzy alignment, CTC-based forced alignment leverages Viterbi decoding of transcription tokens with time frames for clear, "hard" segmentation.
  • This is especially pertinent for AST, where segment-level (rather than word-level) timestamps are more robust due to differences in source/target word order.

This forced alignment strategy enables consistent, accurate time annotation, essential for high-value ASR/AST integration in production settings.

5. Comparison with Parakeet-TDT-0.6B-v3 and Other Systems

Parakeet-TDT-0.6B-v3 is introduced as a highly parameter-efficient ASR-only model:

Model Parameters Scope WER (avg, multilingual) Throughput (RTFx)
Canary-1B-v2 ~1B ASR, AST 8.1% ~749
Parakeet-TDT-0.6B-v3 600M ASR only Slightly higher than C1B-v2 >3300
LLM-based ASR systems 7B+ Flexible Higher WER <62
  • Parakeet-TDT-0.6B-v3 is trained solely on 660,000 hours of ASR data, supporting the same 25 languages with lower computational footprint and very competitive performance.
  • It achieves an RTFx exceeding 3300 (over 54× faster than some LLM-based baselines) with only modest degradation in WER compared to Canary-1B-v2.
  • This highlights that with focused data curation and dynamic balancing, high-quality ASR does not require scaling to extreme parameter counts.

6. Techniques for Hallucination Reduction

To address common failure modes in ASR and AST such as hallucination:

  • Non-speech audio (36,000 hours of empty-string targets) is included in pre-training to expose the model to negative examples, thereby reducing spurious transcriptions in both ASR and AST tasks.
  • The presence of weakly supervised, large-scale audio-text data in training encourages greater generalization and minimizes overfitting to spurious patterns.

This suggests effective suppression of hallucinations is possible through large-scale negative sampling and careful curriculum selection rather than extensive architectural overhauls.

7. Significance and Application Scope

  • Canary-1B-v2 demonstrates that architectural efficiency, large-scale multilingual training, dynamic balancing, and robust forced alignment collectively enable state-of-the-art performance while remaining deployable in resource-constrained scenarios.
  • The approach is adaptable to both broad-coverage multilingual ASR and speech translation, and with minor modifications, can be extended to additional modalities or languages.

This suggests a paradigm where models with moderate parameter counts, when paired with sophisticated training protocols and modular alignment tools, can match or surpass the output quality of much larger systems across practical, real-world tasks.


For further technical details and experimental results, see "Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST" (Sekoyan et al., 17 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Canary-1B-v2.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube