Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 166 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration (2505.04457v2)

Published 7 May 2025 in cs.SD, cs.CL, and eess.AS

Abstract: Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like LLMs. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaveFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2's superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.

Summary

Miipher-2: A Universal Speech Restoration Framework for Large-Scale Data

Miipher-2 presents a universal speech restoration model that aims to address the unique complexities associated with large-scale speech data restoration, particularly for training data cleaning in generative models, including LLMs. The motivation for this work lies in the necessity of reliable, high-quality audio data for training sophisticated models, as traditional methods of data collection such as web-scraping often introduce noisy samples.

Innovation Through Self-Supervised Learning

The Miipher-2 model leverages the Universal Speech Model (USM) as a robust, conditioning-free feature extractor. This approach sidesteps the need for explicit textual or speaker ID conditioning, enabling the model to generalize across over 300 languages, including those with limited available high-quality training data. Such a feature allows the model to infer clean USM features even from noisy input, successfully addressing the challenge of handling unseen languages.

Architectural Efficiency and Optimization

A key aspect of Miipher-2 lies in its computational efficiency, which is crucial given the scale of data involved. The use of parallel adapters (PAs) in place of conventional, more cumbersome feature cleaner architectures minimizes memory footprint and accelerates processing speed. Additionally, the integration of the WaneFit neural vocoder, with optimized memory usage adjustments, further drives the efficiency required for processing up to a million hours of speech data in a frame of roughly three days using consumer-grade accelerators.

Competitive Performance Metrics

The experimental evaluation illustrates Miipher-2’s efficacy in speech restoration tasks, delivering performance that is superior or comparable to existing models in word-error-rate (WER), speaker similarity, and MOS scores. Importantly, these results are achieved across varying languages, including those not originally represented within the training datasets, demonstrating the model’s broad applicability.

Practical and Theoretical Implications

The implications of Miipher-2 are significant for both practical applications and theoretical understanding in AI. Practically, the framework facilitates the cleaning and enhancement of massive speech datasets, pivotal for developing models that support text-to-speech synthesis and other audio-sensitive applications. Theoretically, it presents a substantial contribution to the research on self-supervised learning models, emphasizing their ability to generalize without explicit conditioning information.

Looking Forward: Future Developments in AI

Miipher-2 sets a foundation for further exploration into speech restoration practices, particularly those requiring efficient processing of large datasets. Future research could investigate extending the methodology for broader audio applications beyond speech, enriching multi-modal generative model training datasets. Moreover, the methodology could inspire innovations in low-resource language support, pushing the boundaries of AI inclusivity.

In summary, Miipher-2 proves to be an effective, efficient model capable of tackling the inherent challenges of large-scale data cleaning, offering promising directions for evolving speech processing technologies and AI applications on a global scale.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 posts and received 83 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube