Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
109 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
35 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
5 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Long-Form Speech Generation with Spoken Language Models (2412.18603v1)

Published 24 Dec 2024 in cs.CL, cs.SD, and eess.AS

Abstract: We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, current spoken LLMs struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to memory costs at inference time. With these considerations we propose SpeechSSM, the first speech LLM to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates, based on recent advances in linear-time sequence modeling. Furthermore, to address growing challenges in spoken language evaluation, especially in this new long-form setting, we propose: new embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long. Speech samples and the dataset are released at https://google.github.io/tacotron/publications/speechssm/

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com