Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Slamming: Training a Speech Language Model on One GPU in a Day (2502.15814v2)

Published 19 Feb 2025 in cs.LG, cs.AI, cs.CL, cs.SD, and eess.AS

Abstract: We introduce Slam, a recipe for training high-quality Speech LLMs (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Gallil Maimon (8 papers)
  2. Avishai Elmakies (2 papers)
  3. Yossi Adi (96 papers)

Summary

Analysis of "Slamming: Training a Speech LLM on One GPU in a Day"

This paper introduces "Slam," a method designed to train high-quality Speech LLMs (SLMs) on a limited computational budget, specifically using a single academic GPU within a 24-hour period. The authors systematically explore a range of training strategies and techniques aimed at optimizing performance within such constraints, while also empirically demonstrating scalability to more computational resources.

Key Contributions and Methodology

  1. Training Strategy: The authors investigate the influence of various training components, including model initialization and architecture, synthetic training data, preference optimization, and hyperparameter tuning. They derive a comprehensive training recipe to maximize model performance while adhering to a fixed computational budget.
  2. Empirical Insights: The paper emphasizes that utilizing synthetic data and incorporating diverse efficiency optimizations can significantly enhance SLM performance. The proposed "Slamming" process involves exploring variants in model initialization leveraging text-based pre-trained models, thereby enhancing convergence and performance. Key components include:
    • Utilizing TWIST initialization and Qwen2.5 architecture for improved performance.
    • Employing synthetic datasets generated via Text-to-Speech (TTS) methodologies.
    • Preference optimization through synthetic data to improve alignment with semantic tasks.
  3. Performance Evaluation: The paper introduces several evaluation metrics to comprehensively assess SLM performance, including sBLIMP, Spoken Story Cloze (sSC), Topic Story-Cloze (tSC), and Generative Perplexity (GenPPL). The authors benchmark their approach against existing state-of-the-art models, achieving comparable or superior outcomes with significantly reduced computational resources.

Notable Results

  • The Slam methodology demonstrates that it is feasible to train SLMs with high effectiveness on a highly constrained budget. Using the Qwen2.5-0.5B model initializes through TWIST, the authors report improvements across the board on established benchmarks for SLM evaluation.
  • The inclusion of synthetic training data, such as sTinyStories, was found to significantly boost both modeling and generative performance, showcasing the utility of synthetic datasets in low-compute setups.
  • Performance comparisons show that Slam not only rivals but can exceed the predicted computationally optimal performances projected under traditional SLM scaling laws.

Implications and Future Directions

The results presented in this paper have profound implications for making SLM training more democratized and accessible, allowing smaller academic labs to participate in cutting-edge speech model research. The open-source release of code and data facilitates further exploration and validation in broader contexts, potentially spurring innovation in the SLM field.

The findings underscore the importance of adaptive learning strategies and the growing relevance of synthetic data in training SLMs efficiently. Developers and researchers could leverage these insights to optimize training pipelines under constrained environments, promoting more sustainable AI practices.

Conclusion

The paper provides a valuable contribution to the field of SLM research, challenging traditional notions of resource-intensive model training. By aligning training efficiency with advanced techniques such as preference optimization and exploiting synthetic data, the paper carves out a pragmatic path for the future of speech-LLMing.

In sum, the "Slamming" approach delineated in this work not only suggests a method to reduce the cost and barriers to engaging in SLM training but paves the way for scalable and feasible exploration in the field, encouraging new norms in resource allocation for AI research.