Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 13 tok/s Pro

GPT-5 High 16 tok/s Pro

GPT-4o 98 tok/s Pro

GPT OSS 120B 472 tok/s Pro

Kimi K2 210 tok/s Pro

2000 character limit reached

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing (2406.06582v2)

Published 4 Jun 2024 in cs.CL, cs.LG, and eess.AS

Abstract: Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, LLMs pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal LLM (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.

Citations (1)

View on Semantic Scholar

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing (2406.06582v2)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (6)