Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GigaAM: Efficient Self-Supervised Learner for Speech Recognition (2506.01192v1)

Published 1 Jun 2025 in eess.AS and cs.SD

Abstract: Self-Supervised Learning (SSL) has demonstrated strong performance in speech processing, particularly in automatic speech recognition. In this paper, we explore an SSL pretraining framework that leverages masked LLMing with targets derived from a speech recognition model. We also present chunkwise attention with dynamic chunk size sampling during pretraining to enable both full-context and streaming fine-tuning. Our experiments examine scaling with respect to model size and the amount of data. Using our method, we train the GigaAM family of models, including a state-of-the-art model for Russian speech recognition that outperforms Whisper-large-v3 by 50%. We have released our foundation and ASR models, along with the inference code, under the MIT license as open-source resources to the research community. Available at https://github.com/salute-developers/gigaam.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aleksandr Kutsakov (1 paper)
  2. Alexandr Maximenko (1 paper)
  3. Georgii Gospodinov (1 paper)
  4. Pavel Bogomolov (2 papers)
  5. Fyodor Minkin (2 papers)

Summary

The paper "GigaAM: Efficient Self-Supervised Learner for Speech Recognition" (Kutsakov et al., 1 Jun 2025 ) introduces a self-supervised learning (SSL) framework for speech recognition that aims to improve efficiency and performance, particularly in low-resource settings and for specific languages like Russian. The core idea is to leverage masked LLMing using target variables derived from a supervised Automatic Speech Recognition (ASR) model, rather than relying solely on low-level features or intermediate layers as some prior methods do.

The authors propose a pretraining method called HuBERT-CTC, which builds upon the HuBERT framework. Unlike standard HuBERT or BEST-RQ [bestrq], HuBERT-CTC generates discrete target tokens for the masked prediction task by applying K-means clustering to the hidden states of the last layer of a pre-trained, fine-tuned CTC-based ASR model. This is motivated by observations that the final layers of ASR models tend to capture more semantically rich information relevant to the ASR task itself. The empirical results presented in the paper (Fig 1) show that probing the layers of encoders pretrained with HuBERT-CTC yields monotonically improving ASR performance up to the final layer, suggesting the learned representations are highly relevant for the downstream ASR task, which is beneficial especially in low-resource fine-tuning scenarios.

A practical challenge in ASR, especially for long-form audio or streaming applications, is handling variable input lengths and providing timely output. To address this, the paper integrates a chunkwise attention mechanism with dynamic chunk size sampling during pretraining. This involves partitioning the input audio into segments of varying lengths (e.g., 1s, 2s, 4s, 8s) and computing attention within these local chunks. By sampling different chunk sizes during pretraining, the model learns to adapt to different context lengths, enabling a single pretrained model to be fine-tuned for both full-context and streaming ASR setups without requiring separate pretraining runs. For streaming inference, the model utilizes chunkwise causal convolutions [chunk_causal] within the Conformer architecture to restrict the receptive field and allow processing audio in fixed-size chunks with limited future context.

The paper highlights several key practical aspects and contributions:

  1. Semantically Enriched Targets: Using targets from the final layer of a fine-tuned ASR model via K-means clustering provides more ASR-specific representations compared to using low-level features or intermediate layer outputs. This is shown to yield better performance, especially when fine-tuning on limited labeled data.
  2. Effective Data Preprocessing: A simple Voice Activity Detection (VAD) filtering step is applied to remove segments with excessive silence from the unlabeled pretraining data. This inexpensive preprocessing step is found to improve model performance significantly (5-10% average improvement), while more complex data filtering methods explored had negligible impact.
  3. Unified Pretraining for Full-Context and Streaming: The dynamic chunk size sampling strategy during pretraining allows the model to learn representations suitable for different inference modes. This avoids the need for separate pretraining for streaming-specific models, simplifying the development workflow and potentially reducing computational costs.
  4. State-of-the-Art Performance on Russian: The GigaAM models, particularly the CTC and RNN-T variants fine-tuned on Russian data, demonstrate strong performance on multiple Russian ASR benchmarks, significantly outperforming other large models like Whisper-large-v3 [whisper] on the evaluated datasets (Table 1). This underscores the effectiveness of the domain-specific pretraining approach.
  5. Scalability Analysis: The authors conducted systematic experiments analyzing the impact of pretraining data size, fine-tuning data size, model capacity, and pretraining steps on performance. Key findings include:
    • Performance stabilizes after a certain threshold of unsupervised pretraining data (~6k hours in their experiments), indicating efficient knowledge transfer from the teacher model even with a relatively small unsupervised dataset.
    • Smaller GigaAM models (e.g., 100M parameters) can outperform larger teacher models (e.g., 240M parameters), suggesting the potential for developing more efficient models.
    • Pretraining acts as a form of regularization, effectively reducing the need for explicit regularization techniques like CR-CTC [crctc], especially for smaller models.
  6. Open-Source Release: The pretrained foundation models, fine-tuned ASR models for Russian, and inference code are released under the MIT license, facilitating reproducibility and further research and application development by the community.

Implementation Considerations:

  • Teacher Model: Implementing HuBERT-CTC requires a pre-trained, fine-tuned ASR model (preferably CTC-based) to generate the clustering targets. The quality and domain relevance of this teacher model are crucial.
  • Clustering: K-means clustering needs to be applied to the hidden states of the teacher model's last layer across the pretraining dataset to generate the discrete targets. This involves extracting features, running the clustering algorithm, and assigning tokens.
  • Data Pipeline: The pretraining data pipeline needs to handle audio loading, feature extraction, VAD filtering, masking, and target generation based on the teacher model and K-means centroids.
  • Dynamic Chunking: Implementing dynamic chunking requires modifying the data loader to sample different chunk sizes for different training samples within a batch. The model architecture (Conformer) needs to support chunkwise attention and optionally chunkwise causal convolutions for streaming.
  • Computational Resources: While the paper explores efficiency gains and data scaling properties, training large SSL models like GigaAM on 100k hours of data still requires significant computational resources (GPUs). The scaling analysis helps in deciding the trade-offs between model size, data size, and compute budget for specific applications.
  • Low-Resource Adaptation: The findings suggest that HuBERT-CTC is particularly effective in low-resource scenarios, making it a valuable approach for developing ASR systems for languages with limited labeled data, provided sufficient unlabeled data and a suitable teacher model are available.
  • Inference Modes: The dynamic chunking allows for flexible deployment, supporting full-context processing for tasks like transcription of recorded audio and streaming processing for real-time applications. Choosing the appropriate chunk size during inference depends on the required latency and accuracy trade-offs. The paper's results indicate that smaller chunk sizes lead to higher WER but enable lower latency streaming.
Github Logo Streamline Icon: https://streamlinehq.com