Slamming: Training a Speech Language Model on One GPU in a Day (2502.15814v2)

Published 19 Feb 2025 in cs.LG, cs.AI, cs.CL, cs.SD, and eess.AS

Abstract: We introduce Slam, a recipe for training high-quality Speech LLMs (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .

Authors (3)

Gallil Maimon (8 papers)
Avishai Elmakies (2 papers)
Yossi Adi (96 papers)

Summary

Analysis of "Slamming: Training a Speech LLM on One GPU in a Day"

This paper introduces "Slam," a method designed to train high-quality Speech LLMs (SLMs) on a limited computational budget, specifically using a single academic GPU within a 24-hour period. The authors systematically explore a range of training strategies and techniques aimed at optimizing performance within such constraints, while also empirically demonstrating scalability to more computational resources.

Key Contributions and Methodology

Training Strategy: The authors investigate the influence of various training components, including model initialization and architecture, synthetic training data, preference optimization, and hyperparameter tuning. They derive a comprehensive training recipe to maximize model performance while adhering to a fixed computational budget.
Empirical Insights: The paper emphasizes that utilizing synthetic data and incorporating diverse efficiency optimizations can significantly enhance SLM performance. The proposed "Slamming" process involves exploring variants in model initialization leveraging text-based pre-trained models, thereby enhancing convergence and performance. Key components include:
- Utilizing TWIST initialization and Qwen2.5 architecture for improved performance.
- Employing synthetic datasets generated via Text-to-Speech (TTS) methodologies.
- Preference optimization through synthetic data to improve alignment with semantic tasks.
Performance Evaluation: The paper introduces several evaluation metrics to comprehensively assess SLM performance, including sBLIMP, Spoken Story Cloze (sSC), Topic Story-Cloze (tSC), and Generative Perplexity (GenPPL). The authors benchmark their approach against existing state-of-the-art models, achieving comparable or superior outcomes with significantly reduced computational resources.

Notable Results

The Slam methodology demonstrates that it is feasible to train SLMs with high effectiveness on a highly constrained budget. Using the Qwen2.5-0.5B model initializes through TWIST, the authors report improvements across the board on established benchmarks for SLM evaluation.
The inclusion of synthetic training data, such as sTinyStories, was found to significantly boost both modeling and generative performance, showcasing the utility of synthetic datasets in low-compute setups.
Performance comparisons show that Slam not only rivals but can exceed the predicted computationally optimal performances projected under traditional SLM scaling laws.

Implications and Future Directions

The results presented in this paper have profound implications for making SLM training more democratized and accessible, allowing smaller academic labs to participate in cutting-edge speech model research. The open-source release of code and data facilitates further exploration and validation in broader contexts, potentially spurring innovation in the SLM field.

The findings underscore the importance of adaptive learning strategies and the growing relevance of synthetic data in training SLMs efficiently. Developers and researchers could leverage these insights to optimize training pipelines under constrained environments, promoting more sustainable AI practices.

Conclusion

The paper provides a valuable contribution to the field of SLM research, challenging traditional notions of resource-intensive model training. By aligning training efficiency with advanced techniques such as preference optimization and exploiting synthetic data, the paper carves out a pragmatic path for the future of speech-LLMing.

In sum, the "Slamming" approach delineated in this work not only suggests a method to reduce the cost and barriers to engaging in SLM training but paves the way for scalable and feasible exploration in the field, encouraging new norms in resource allocation for AI research.

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1894249272935100714

https://twitter.com/GallilMaimon/status/1923238508145091070

https://twitter.com/GallilMaimon/status/1894261550686323004

https://twitter.com/fly51fly/status/1894505516115886299

https://twitter.com/Tu7uruu/status/1894337372453573095

https://twitter.com/papers_anon/status/1894259430365303045