LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT (2310.04673v4)
Abstract: Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text LLMs. Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete codec codes. We propose a one-step codec vocoder to overcome the prediction challenge caused by the multimodal distribution of codec tokens. We fine-tune LauraGPT using supervised multi-task learning. Extensive experiments show that LauraGPT consistently achieves comparable to superior performance compared to strong baselines on a wide range of audio tasks related to content, semantics, paralinguistics, and audio-signal analysis, such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding.
- The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018.
- Palm 2 technical report. CoRR, abs/2305.10403, 2023. doi: 10.48550/arXiv.2305.10403. URL https://doi.org/10.48550/arXiv.2305.10403.
- Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 5723–5738. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.393. URL https://doi.org/10.18653/v1/2022.acl-long.393.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html.
- Qwen technical report. arxiv preprint, 2309.16609, 2023.
- Slurp: A spoken language understanding resource package. arXiv preprint arXiv:2011.13205, 2020.
- Audiolm: A language modeling approach to audio generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:2523–2533, 2023. doi: 10.1109/TASLP.2023.3288409. URL https://doi.org/10.1109/TASLP.2023.3288409.
- Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp. 1–5. IEEE, 2017.
- Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
- Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.
- CMGAN: Conformer-based Metric GAN for Speech Enhancement. In Proc. Interspeech 2022, pp. 936–940, 2022. doi: 10.21437/Interspeech.2022-517.
- Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021a.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- Vesper: A compact and effective pretrained model for speech emotion recognition. CoRR, abs/2307.10757, 2023.
- Speechnet: A universal modularized model for speech processing tasks. CoRR, abs/2105.03070, 2021b. URL https://arxiv.org/abs/2105.03070.
- Simple and controllable music generation. CoRR, abs/2306.05284, 2023.
- High fidelity neural audio compression. CoRR, abs/2210.13438, 2022. doi: 10.48550/arXiv.2210.13438. URL https://doi.org/10.48550/arXiv.2210.13438.
- Pengi: An audio language model for audio tasks. CoRR, abs/2305.11834, 2023. doi: 10.48550/arXiv.2305.11834. URL https://doi.org/10.48550/arXiv.2305.11834.
- Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 736–740. IEEE, 2020.
- Aishell-2: Transforming mandarin asr research into industrial scale. arXiv preprint arXiv:1808.10583, 2018.
- Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec, 2023.
- High fidelity neural audio compression. arXiv:2210.13438, 2022.
- CLAP: learning audio concepts from natural language supervision. CoRR, abs/2206.04769, 2022.
- FSD50K: an open dataset of human-labeled sound events. IEEE ACM Trans. Audio Speech Lang. Process., 30:829–852, 2022.
- Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition. In INTERSPEECH, pp. 2063–2067. ISCA, 2022.
- Funasr: A fundamental end-to-end speech recognition toolkit. In INTERSPEECH, 2023.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 29:3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291. URL https://doi.org/10.1109/TASLP.2021.3122291.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In ICML, volume 202 of Proceedings of Machine Learning Research, pp. 13916–13932. PMLR, 2023a.
- Audiogpt: Understanding and generating speech, music, sound, and talking head. CoRR, abs/2304.12995, 2023b. doi: 10.48550/arXiv.2304.12995. URL https://doi.org/10.48550/arXiv.2304.12995.
- Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK, 2014.
- Shantanu Jain. tiktoken: A fast BPE tokeniser for use with OpenAI’s models, 2022. URL https://github.com/openai/tiktoken/.
- LMCodec: A low bitrate speech codec with causal transformer models. In ICASSP, 2023.
- Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 119–132, 2019.
- A study on data augmentation of reverberant speech for robust speech recognition. In ICASSP, pp. 5220–5224, 2017.
- Findings of the 2022 conference on machine translation (wmt22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 1–45, 2022.
- The ntt dcase2020 challenge task 6 system: Automated audio captioning with keywords and sentence length estimation. arXiv preprint arXiv:2007.00225, 2020.
- Audiogen: Textually guided audio generation. In ICLR. OpenReview.net, 2023.
- Audioldm: Text-to-audio generation with latent diffusion models. In ICML, volume 202 of Proceedings of Machine Learning Research, pp. 21450–21474. PMLR, 2023.
- The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
- Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023.
- Lms with a voice: Spoken language modeling beyond speech tokens. CoRR, abs/2305.15255, 2023.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE, 2015.
- The design for the wall street journal-based CSR corpus. In ICSLP, pp. 899–902. ISCA, 1992.
- Toronto emotional speech set (tess). Scholars Portal Dataverse, 1:2020, 2020.
- Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018.
- Robust speech recognition via large-scale weak supervision. CoRR, abs/2212.04356, 2022. doi: 10.48550/arXiv.2212.04356. URL https://doi.org/10.48550/arXiv.2212.04356.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
- Speechbrain: A general-purpose speech toolkit. CoRR, abs/2106.04624, 2021. URL https://arxiv.org/abs/2106.04624.
- Audiopalm: A large language model that can speak and listen. CoRR, abs/2306.12925, 2023. doi: 10.48550/arXiv.2306.12925. URL https://doi.org/10.48550/arXiv.2306.12925.
- Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023. doi: 10.48550/arXiv.2303.17580. URL https://doi.org/10.48550/arXiv.2303.17580.
- Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
- Seanet: A multi-modal speech enhancement network. In INTERSPEECH, pp. 1126–1130, 2020.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
- A Vasuki and PT Vanathi. A review of vector quantization techniques. IEEE Potentials, 25(4):39–47, 2006.
- Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310, 2020.
- Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023a.
- Viola: Unified codec language models for speech recognition, synthesis, and translation. CoRR, abs/2305.16107, 2023b. doi: 10.48550/arXiv.2305.16107. URL https://doi.org/10.48550/arXiv.2305.16107.
- Hw-tsc’s submissions to the wmt 2022 general machine translation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 403–410, 2022a.
- Learning to generalize to more: Continuous semantic augmentation for neural machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022, 2022b.
- Speechgen: Unlocking the generative power of speech language models with prompts. CoRR, abs/2306.02207, 2023. doi: 10.48550/arXiv.2306.02207. URL https://doi.org/10.48550/arXiv.2306.02207.
- Diffsound: Discrete diffusion model for text-to-sound generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:1720–1733, 2023.
- Vega-mt: The jd explore academy machine translation system for wmt22. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 411–422, 2022.
- Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022. doi: 10.1109/TASLP.2021.3129994. URL https://doi.org/10.1109/TASLP.2021.3129994.
- Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6182–6186. IEEE, 2022.
- Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. CoRR, abs/2305.11000, 2023a. doi: 10.48550/arXiv.2305.11000. URL https://doi.org/10.48550/arXiv.2305.11000.
- Bstc: A large-scale chinese-english speech translation dataset. arXiv preprint arXiv:2104.03575, 2021.
- Google USM: scaling automatic speech recognition beyond 100 languages. CoRR, abs/2303.01037, 2023b. doi: 10.48550/arXiv.2303.01037. URL https://doi.org/10.48550/arXiv.2303.01037.
- 3d-speaker: A large-scale multi-device, multi-distance, and multi-dialect corpus for speech representation disentanglement. arxiv preprint, 2306.15354, 2023.
- Jiaming Wang (37 papers)
- Zhihao Du (30 papers)
- Qian Chen (264 papers)
- Yunfei Chu (15 papers)
- Zhifu Gao (28 papers)
- Zerui Li (9 papers)
- Kai Hu (55 papers)
- Xiaohuan Zhou (13 papers)
- Jin Xu (131 papers)
- Ziyang Ma (73 papers)
- Wen Wang (144 papers)
- Siqi Zheng (61 papers)
- Chang Zhou (105 papers)
- Zhijie Yan (33 papers)
- Shiliang Zhang (132 papers)