Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT (2310.04673v4)

Published 7 Oct 2023 in cs.SD, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text LLMs. Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete codec codes. We propose a one-step codec vocoder to overcome the prediction challenge caused by the multimodal distribution of codec tokens. We fine-tune LauraGPT using supervised multi-task learning. Extensive experiments show that LauraGPT consistently achieves comparable to superior performance compared to strong baselines on a wide range of audio tasks related to content, semantics, paralinguistics, and audio-signal analysis, such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018.
  2. Palm 2 technical report. CoRR, abs/2305.10403, 2023. doi: 10.48550/arXiv.2305.10403. URL https://doi.org/10.48550/arXiv.2305.10403.
  3. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  5723–5738. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.393. URL https://doi.org/10.18653/v1/2022.acl-long.393.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html.
  5. Qwen technical report. arxiv preprint, 2309.16609, 2023.
  6. Slurp: A spoken language understanding resource package. arXiv preprint arXiv:2011.13205, 2020.
  7. Audiolm: A language modeling approach to audio generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:2523–2533, 2023. doi: 10.1109/TASLP.2023.3288409. URL https://doi.org/10.1109/TASLP.2023.3288409.
  8. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp.  1–5. IEEE, 2017.
  9. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
  10. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.
  11. CMGAN: Conformer-based Metric GAN for Speech Enhancement. In Proc. Interspeech 2022, pp.  936–940, 2022. doi: 10.21437/Interspeech.2022-517.
  12. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021a.
  13. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
  14. Vesper: A compact and effective pretrained model for speech emotion recognition. CoRR, abs/2307.10757, 2023.
  15. Speechnet: A universal modularized model for speech processing tasks. CoRR, abs/2105.03070, 2021b. URL https://arxiv.org/abs/2105.03070.
  16. Simple and controllable music generation. CoRR, abs/2306.05284, 2023.
  17. High fidelity neural audio compression. CoRR, abs/2210.13438, 2022. doi: 10.48550/arXiv.2210.13438. URL https://doi.org/10.48550/arXiv.2210.13438.
  18. Pengi: An audio language model for audio tasks. CoRR, abs/2305.11834, 2023. doi: 10.48550/arXiv.2305.11834. URL https://doi.org/10.48550/arXiv.2305.11834.
  19. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  736–740. IEEE, 2020.
  20. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv preprint arXiv:1808.10583, 2018.
  21. Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec, 2023.
  22. High fidelity neural audio compression. arXiv:2210.13438, 2022.
  23. CLAP: learning audio concepts from natural language supervision. CoRR, abs/2206.04769, 2022.
  24. FSD50K: an open dataset of human-labeled sound events. IEEE ACM Trans. Audio Speech Lang. Process., 30:829–852, 2022.
  25. Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition. In INTERSPEECH, pp.  2063–2067. ISCA, 2022.
  26. Funasr: A fundamental end-to-end speech recognition toolkit. In INTERSPEECH, 2023.
  27. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 29:3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291. URL https://doi.org/10.1109/TASLP.2021.3122291.
  28. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In ICML, volume 202 of Proceedings of Machine Learning Research, pp.  13916–13932. PMLR, 2023a.
  29. Audiogpt: Understanding and generating speech, music, sound, and talking head. CoRR, abs/2304.12995, 2023b. doi: 10.48550/arXiv.2304.12995. URL https://doi.org/10.48550/arXiv.2304.12995.
  30. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK, 2014.
  31. Shantanu Jain. tiktoken: A fast BPE tokeniser for use with OpenAI’s models, 2022. URL https://github.com/openai/tiktoken/.
  32. LMCodec: A low bitrate speech codec with causal transformer models. In ICASSP, 2023.
  33. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  119–132, 2019.
  34. A study on data augmentation of reverberant speech for robust speech recognition. In ICASSP, pp.  5220–5224, 2017.
  35. Findings of the 2022 conference on machine translation (wmt22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.  1–45, 2022.
  36. The ntt dcase2020 challenge task 6 system: Automated audio captioning with keywords and sentence length estimation. arXiv preprint arXiv:2007.00225, 2020.
  37. Audiogen: Textually guided audio generation. In ICLR. OpenReview.net, 2023.
  38. Audioldm: Text-to-audio generation with latent diffusion models. In ICML, volume 202 of Proceedings of Machine Learning Research, pp.  21450–21474. PMLR, 2023.
  39. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
  40. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023.
  41. Lms with a voice: Spoken language modeling beyond speech tokens. CoRR, abs/2305.15255, 2023.
  42. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  43. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5206–5210. IEEE, 2015.
  44. The design for the wall street journal-based CSR corpus. In ICSLP, pp.  899–902. ISCA, 1992.
  45. Toronto emotional speech set (tess). Scholars Portal Dataverse, 1:2020, 2020.
  46. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018.
  47. Robust speech recognition via large-scale weak supervision. CoRR, abs/2212.04356, 2022. doi: 10.48550/arXiv.2212.04356. URL https://doi.org/10.48550/arXiv.2212.04356.
  48. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
  49. Speechbrain: A general-purpose speech toolkit. CoRR, abs/2106.04624, 2021. URL https://arxiv.org/abs/2106.04624.
  50. Audiopalm: A large language model that can speak and listen. CoRR, abs/2306.12925, 2023. doi: 10.48550/arXiv.2306.12925. URL https://doi.org/10.48550/arXiv.2306.12925.
  51. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023. doi: 10.48550/arXiv.2303.17580. URL https://doi.org/10.48550/arXiv.2303.17580.
  52. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  53. Seanet: A multi-modal speech enhancement network. In INTERSPEECH, pp.  1126–1130, 2020.
  54. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  55. A Vasuki and PT Vanathi. A review of vector quantization techniques. IEEE Potentials, 25(4):39–47, 2006.
  56. Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310, 2020.
  57. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023a.
  58. Viola: Unified codec language models for speech recognition, synthesis, and translation. CoRR, abs/2305.16107, 2023b. doi: 10.48550/arXiv.2305.16107. URL https://doi.org/10.48550/arXiv.2305.16107.
  59. Hw-tsc’s submissions to the wmt 2022 general machine translation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.  403–410, 2022a.
  60. Learning to generalize to more: Continuous semantic augmentation for neural machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022, 2022b.
  61. Speechgen: Unlocking the generative power of speech language models with prompts. CoRR, abs/2306.02207, 2023. doi: 10.48550/arXiv.2306.02207. URL https://doi.org/10.48550/arXiv.2306.02207.
  62. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:1720–1733, 2023.
  63. Vega-mt: The jd explore academy machine translation system for wmt22. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.  411–422, 2022.
  64. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022. doi: 10.1109/TASLP.2021.3129994. URL https://doi.org/10.1109/TASLP.2021.3129994.
  65. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6182–6186. IEEE, 2022.
  66. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. CoRR, abs/2305.11000, 2023a. doi: 10.48550/arXiv.2305.11000. URL https://doi.org/10.48550/arXiv.2305.11000.
  67. Bstc: A large-scale chinese-english speech translation dataset. arXiv preprint arXiv:2104.03575, 2021.
  68. Google USM: scaling automatic speech recognition beyond 100 languages. CoRR, abs/2303.01037, 2023b. doi: 10.48550/arXiv.2303.01037. URL https://doi.org/10.48550/arXiv.2303.01037.
  69. 3d-speaker: A large-scale multi-device, multi-distance, and multi-dialect corpus for speech representation disentanglement. arxiv preprint, 2306.15354, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Jiaming Wang (37 papers)
  2. Zhihao Du (30 papers)
  3. Qian Chen (264 papers)
  4. Yunfei Chu (15 papers)
  5. Zhifu Gao (28 papers)
  6. Zerui Li (9 papers)
  7. Kai Hu (55 papers)
  8. Xiaohuan Zhou (13 papers)
  9. Jin Xu (131 papers)
  10. Ziyang Ma (73 papers)
  11. Wen Wang (144 papers)
  12. Siqi Zheng (61 papers)
  13. Chang Zhou (105 papers)
  14. Zhijie Yan (33 papers)
  15. Shiliang Zhang (132 papers)
Citations (66)