Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparks of Large Audio Models: A Survey and Outlook (2308.12792v3)

Published 24 Aug 2023 in cs.SD and eess.AS
Sparks of Large Audio Models: A Survey and Outlook

Abstract: This survey paper provides a comprehensive overview of the recent advancements and challenges in applying LLMs to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.

Overview of Large Audio Models: A Survey and Outlook

The paper, "Sparks of Large Audio Models: A Survey and Outlook," provides an in-depth exploration of recent advancements in applying LLMs to audio signal processing. These models, driven primarily by transformer architectures, have demonstrated impressive capabilities across various audio tasks including Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Music Generation.

Key Contributions

  1. State-of-the-Art Audio Models: The paper discusses foundational audio models such as SeamlessM4T, which function as universal translators capable of performing multiple speech tasks across over 100 languages without task-specific systems. This represents a significant leap in multimodality integration within AI systems.
  2. Performance Benchmarks and Methodologies: The authors meticulously analyze state-of-the-art methods and performance benchmarks. Through rigorous evaluation, they demonstrate the applicability of these large audio models in real-world scenarios, highlighting their strengths in scalability and versatility.
  3. Identification of Challenges: Current limitations include handling diverse signal representations, managing data variability, and ensuring model robustness across different audio sources. The paper also addresses the emergent challenges of integrating these abilities for real-world applications.
  4. Future Directions: The authors provide insights into potential research avenues to enhance Large Audio Models, fostering innovation and addressing existing challenges. These include improving data handling, refining transformer architectures, and better integration across audio and language tasks.

Implications and Future Outlook

The integration of LLMs into audio processing delineates a new frontier in AI, with profound implications for industries reliant on speech and music technologies. The ability to handle various audio tasks with foundational models reduces the complexities associated with multiple task-specific systems, streamlining processes and enhancing efficiency.

Practical Implications:

  • ASR and TTS: Robust models like SeamlessM4T can drastically enhance the performance of voice assistants, transcription services, and real-time translation devices. The paper suggests that these models could significantly impact sectors such as telecommunications, healthcare, and virtual assistance.
  • Music and Sound Generation: With models like AudioLM and MusicGen, the potential for AI-driven creativity in music production could be transformative, suggesting new pathways in digital media and entertainment industries.

Theoretical Implications:

  • Multimodal Learning: As these models continue to advance, the theoretical understanding of multimodal interactions between text and audio will deepen, potentially leading to more holistic AI systems capable of seamless cross-modal understanding and generation.
  • Language and Audio Interaction: Understanding the nuanced interactions between language and audio signals can lead to enhanced generalization capabilities across different AI domains, pushing the boundaries of existing models.

Speculation on Future Developments:

Future developments could see even greater integration of AI across modalities, leading to systems that not only understand but also generate nuanced multimedia content. As foundational models become more sophisticated, their emergent abilities could introduce breakthroughs in artificial general intelligence.

Conclusion

The paper underscores the transformative potential of Large Audio Models in redefining audio signal processing, bringing forth a comprehensive overview of current methodologies, challenges, and future directions. It serves as a crucial resource for researchers aiming to navigate and contribute to this rapidly evolving landscape, emphasizing the importance of continual updates and community engagement to foster innovation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (295)
  1. M. B. Hoy, “Alexa, siri, cortana, and more: an introduction to voice assistants,” Medical reference services quarterly, vol. 37, no. 1, pp. 81–88, 2018.
  2. B. Li, T. N. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. K. Chin et al., “Acoustic modeling for google home.” in Interspeech, 2017, pp. 399–403.
  3. T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang et al., “Advances in online audio-visual meeting transcription,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2019, pp. 276–283.
  4. Y. Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 696–700.
  5. S. Latif, J. Qadir, A. Qayyum, M. Usama, and S. Younis, “Speech technology for healthcare: Opportunities, challenges, and state of the art,” IEEE Reviews in Biomedical Engineering, vol. 14, pp. 342–356, 2020.
  6. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
  7. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  8. H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang, and T. Sainath, “Deep learning for audio signal processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 206–219, 2019.
  9. A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
  10. R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  11. W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  12. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
  13. R. Schaeffer, B. Miranda, and S. Koyejo, “Are emergent abilities of large language models a mirage?” arXiv preprint arXiv:2304.15004, 2023.
  14. P. Liu, Z. Liu, Z.-F. Gao, D. Gao, W. X. Zhao, Y. Li, B. Ding, and J.-R. Wen, “Do emergent abilities exist in quantized large language models: An empirical study,” arXiv preprint arXiv:2307.08072, 2023.
  15. S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
  16. OpenAI, “Gpt-4 technical report,” OpenAI, 2023.
  17. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  18. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
  19. M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundational models defining a new era in vision: A survey and outlook,” arXiv preprint arXiv:2307.13721, 2023.
  20. J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” arXiv preprint arXiv:2304.00685, 2023.
  21. S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al., “A comparative study on transformer vs rnn in speech applications,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2019, pp. 449–456.
  22. S. Latif, A. Zaidi, H. Cuayahuitl, F. Shamshad, M. Shoukat, and J. Qadir, “Transformers in speech processing: A survey,” arXiv preprint arXiv:2303.11607, 2023.
  23. A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria, “A review of deep learning techniques for speech processing,” Information Fusion, p. 101869, 2023.
  24. S. Latif, H. Cuayáhuitl, F. Pervez, F. Shamshad, H. S. Ali, and E. Cambria, “A survey on deep reinforcement learning for audio-based applications,” Artificial Intelligence Review, vol. 56, no. 3, pp. 2193–2240, 2023.
  25. Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, L. Yang, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” arXiv preprint arXiv:2307.03109, 2023.
  26. J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, “Challenges and applications of large language models,” arXiv preprint arXiv:2307.10169, 2023.
  27. Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu, “Aligning large language models with human: A survey,” arXiv preprint arXiv:2307.12966, 2023.
  28. Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, J. Gao et al., “Vision-language pre-training: Basics, recent advances, and future trends,” Foundations and Trends® in Computer Graphics and Vision, vol. 14, no. 3–4, pp. 163–352, 2022.
  29. C. Zhang, L. Liu, Y. Cui, G. Huang, W. Lin, Y. Yang, and Y. Hu, “A comprehensive survey on segment anything model for vision and beyond,” arXiv preprint arXiv:2305.08196, 2023.
  30. E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier et al., “Chatgpt for good? on opportunities and challenges of large language models for education,” Learning and Individual Differences, vol. 103, p. 102274, 2023.
  31. T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. De Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo et al., “Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models,” PLoS digital health, vol. 2, no. 2, p. e0000198, 2023.
  32. J. Qadir, “Engineering education in the era of chatgpt: Promise and pitfalls of generative ai for education,” in 2023 IEEE Global Engineering Education Conference (EDUCON).   IEEE, 2023, pp. 1–9.
  33. J. Rudolph, S. Tan, and S. Tan, “Chatgpt: Bullshit spewer or the end of traditional assessments in higher education?” Journal of Applied Learning and Teaching, vol. 6, no. 1, 2023.
  34. M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, and P. Rajpurkar, “Foundation models for generalist medical artificial intelligence,” Nature, vol. 616, no. 7956, pp. 259–265, 2023.
  35. J. Qiu, L. Li, J. Sun, J. Peng, P. Shi, R. Zhang, Y. Dong, K. Lam, F. P.-W. Lo, B. Xiao et al., “Large ai models in health informatics: Applications, challenges, and the future,” arXiv preprint arXiv:2303.11568, 2023.
  36. M. Wornow, Y. Xu, R. Thapa, B. Patel, E. Steinberg, S. Fleming, M. A. Pfeffer, J. Fries, and N. H. Shah, “The shaky foundations of large language models and foundation models for electronic health records,” npj Digital Medicine, vol. 6, no. 1, p. 135, 2023.
  37. S. Zhang and D. Metaxas, “On the challenges and perspectives of foundation models for medical image analysis,” arXiv preprint arXiv:2306.05705, 2023.
  38. B. Hu, J. Xia, J. Zheng, C. Tan, Y. Huang, Y. Xu, and S. Z. Li, “Protein language models and structure prediction: Connection and progression,” arXiv preprint arXiv:2211.16742, 2022.
  39. C. Tran, S. Khadkikar, and A. Porollo, “Survey of protein sequence embedding models,” International Journal of Molecular Sciences, vol. 24, no. 4, p. 3775, 2023.
  40. A. B. Cyphert, “A human being wrote this law review article: Gpt-3 and the practice of law,” UC Davis L. Rev., vol. 55, p. 401, 2021.
  41. Z. Sun, “A short survey of viewing large language models in legal aspect,” arXiv preprint arXiv:2303.09136, 2023.
  42. J. J. Nay, D. Karamardian, S. B. Lawsky, W. Tao, M. Bhat, R. Jain, A. T. Lee, J. H. Choi, and J. Kasai, “Large language models as tax attorneys: A case study in legal capabilities emergence,” arXiv preprint arXiv:2306.07075, 2023.
  43. S. Yang, O. Nachum, Y. Du, J. Wei, P. Abbeel, and D. Schuurmans, “Foundation models for decision making: Problems, methods, and opportunities,” arXiv preprint arXiv:2303.04129, 2023.
  44. Y. Zhao, I. Misra, P. Krähenbühl, and R. Girdhar, “Learning video representations from large language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6586–6597.
  45. H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,” in Proceedings of the 40th International Conference on Machine Learning, 2023, pp. 21 450–21 474.
  46. Y. Fathullah, C. Wu, E. Lakomkin, J. Jia, Y. Shangguan, K. Li, J. Guo, W. Xiong, J. Mahadeokar, O. Kalinli et al., “Prompting large language models with speech recognition abilities,” arXiv preprint arXiv:2307.11795, 2023.
  47. G. Li, X. Xu, L. Dai, M. Wu, and K. Yu, “Diverse and vivid sound generation from text descriptions,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  48. G. Deshpande, A. Batliner, and B. W. Schuller, “Ai-based human audio processing for covid-19: A comprehensive overview,” Pattern recognition, vol. 122, p. 108289, 2022.
  49. S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, K. Qian, X. Jing, A. Kathan, B. Hu, and B. W. Schuller, “Audio self-supervised learning: A survey,” Patterns, vol. 3, no. 12, 2022.
  50. J.-P. Briot, G. Hadjeres, and F.-D. Pachet, “Deep learning techniques for music generation–a survey,” arXiv preprint arXiv:1709.01620, 2017.
  51. S. Ji, J. Luo, and X. Yang, “A comprehensive survey on deep music generation: Multi-level representations, algorithms, evaluations, and future directions,” arXiv preprint arXiv:2011.06801, 2020.
  52. L. Moysis, L. A. Iliadis, S. P. Sotiroudis, A. D. Boursianis, M. S. Papadopoulou, K.-I. D. Kokkinidis, C. Volos, P. Sarigiannidis, S. Nikolaidis, and S. K. Goudos, “Music deep learning: Deep learning methods for music signal processing-a review of the state-of-the-art,” IEEE Access, 2023.
  53. S. Chachada and C.-C. J. Kuo, “Environmental sound recognition: A survey,” APSIPA Transactions on Signal and Information Processing, vol. 3, p. e14, 2014.
  54. K. Miyazaki, T. Toda, T. Hayashi, and K. Takeda, “Environmental sound processing and its applications,” IEEJ Transactions on Electrical and Electronic Engineering, vol. 14, no. 3, pp. 340–351, 2019.
  55. G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz et al., “Augmented language models: a survey,” arXiv preprint arXiv:2302.07842, 2023.
  56. A. Tornede, D. Deng, T. Eimer, J. Giovanelli, A. Mohan, T. Ruhkopf, S. Segel, D. Theodorakopoulos, T. Tornede, H. Wachsmuth et al., “Automl in the age of large language models: Current challenges, future opportunities and risks,” arXiv preprint arXiv:2306.08107, 2023.
  57. R. Tang, Y.-N. Chuang, and X. Hu, “The science of detecting llm-generated texts,” arXiv preprint arXiv:2303.07205, 2023.
  58. Y. K. Dwivedi, N. Kshetri, L. Hughes, E. L. Slade, A. Jeyaraj, A. K. Kar, A. M. Baabdullah, A. Koohang, V. Raghavan, M. Ahuja et al., ““so what if chatgpt wrote it?” multidisciplinary perspectives on opportunities, challenges and implications of generative conversational ai for research, practice and policy,” International Journal of Information Management, vol. 71, p. 102642, 2023.
  59. F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 1–10.
  60. Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi, “Codet5+: Open code large language models for code understanding and generation,” arXiv preprint arXiv:2305.07922, 2023.
  61. A. Trozze, T. Davies, and B. Kleinberg, “Large language models in cryptocurrency securities cases: Can chatgpt replace lawyers?” arXiv preprint arXiv:2308.06032, 2023.
  62. Y. Qin, S. Hu, Y. Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, Y. Huang, C. Xiao, C. Han et al., “Tool learning with foundation models,” arXiv preprint arXiv:2304.08354, 2023.
  63. I. Schubert, J. Zhang, J. Bruce, S. Bechtle, E. Parisotto, M. Riedmiller, J. T. Springenberg, A. Byravan, L. Hasenclever, and N. Heess, “A generalist dynamics model for control,” arXiv preprint arXiv:2305.10912, 2023.
  64. O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533–1545, 2014.
  65. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
  66. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  67. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  68. Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: A survey,” ACM Computing Surveys, vol. 55, no. 6, pp. 1–28, 2022.
  69. Y. Yu, X. Si, C. Hu, and J. Zhang, “A review of recurrent neural networks: Lstm cells and network architectures,” Neural computation, vol. 31, no. 7, pp. 1235–1270, 2019.
  70. H. Salehinejad, S. Sankar, J. Barfett, E. Colak, and S. Valaee, “Recent advances in recurrent neural networks,” arXiv preprint arXiv:1801.01078, 2017.
  71. S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022.
  72. K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu et al., “A survey on vision transformer,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 87–110, 2022.
  73. F. Shamshad, S. Khan, S. W. Zamir, M. H. Khan, M. Hayat, F. S. Khan, and H. Fu, “Transformers in medical imaging: A survey,” Medical Image Analysis, p. 102802, 2023.
  74. D. Lu, Q. Xie, M. Wei, K. Gao, L. Xu, and J. Li, “Transformers in 3d point clouds: A survey,” arXiv preprint arXiv:2205.07417, 2022.
  75. J. Selva, A. S. Johansen, S. Escalera, K. Nasrollahi, T. B. Moeslund, and A. Clapés, “Video transformers: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  76. A. A. Aleissaee, A. Kumar, R. M. Anwer, S. Khan, H. Cholakkal, G.-S. Xia, and F. S. Khan, “Transformers in remote sensing: A survey,” Remote Sensing, vol. 15, no. 7, p. 1860, 2023.
  77. Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun, “Transformers in time series: A survey,” arXiv preprint arXiv:2202.07125, 2022.
  78. S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019.
  79. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  80. A. Łańcucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6588–6592.
  81. M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y. Liu, “Musicbert: Symbolic music understanding with large-scale pre-training,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 791–800.
  82. C. Hernandez-Olivan and J. R. Beltran, “Music composition with deep learning: A review,” Advances in speech and music technology: computational aspects and applications, pp. 25–50, 2022.
  83. Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023.
  84. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
  85. OpenAI, “Gpt-4 technical report,” https://arxiv.org/pdf/2303.08774.pdf, 2023.
  86. D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv preprint arXiv:2305.11000, 2023.
  87. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  88. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  89. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  90. P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov et al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
  91. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  92. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  93. K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y. Adi, A. Polyak, B. Bolte, T.-A. Nguyen, J. Copet, A. Baevski, A. Mohamed et al., “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
  94. Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “Audiolm: A language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2523–2533, 2023.
  95. Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasacchi, “Soundstorm: Efficient parallel audio generation,” arXiv preprint arXiv:2305.09636, 2023.
  96. N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  97. Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y. Wu, “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2021, pp. 244–250.
  98. J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7669–7673.
  99. F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, J. Copet, D. Parikh, Y. Taigman, and Y. Adi, “Audiogen: Textually guided audio generation,” arXiv preprint arXiv:2209.15352, 2022.
  100. Y. Li, M. Tagliasacchi, O. Rybakov, V. Ungureanu, and D. Roblek, “Real-time speech frequency bandwidth extension,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 691–695.
  101. D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y. Zou, and D. Yu, “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  102. H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei, Q. Kong, Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” arXiv preprint arXiv:2308.05734, 2023.
  103. Y. Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass, “Listen, think, and understand,” arXiv preprint arXiv:2305.10790, 2023.
  104. Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,” arXiv preprint arXiv:2104.01778, 2021.
  105. W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  106. Y. Gong, A. Rouditchenko, A. H. Liu, D. Harwath, L. Karlinsky, H. Kuehne, and J. R. Glass, “Contrastive audio-visual masked autoencoder,” in The Eleventh International Conference on Learning Representations, 2022.
  107. J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2017, pp. 776–780.
  108. I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” arXiv preprint arXiv:1312.6211, 2013.
  109. T. Wang, L. Zhou, Z. Zhang, Y. Wu, S. Liu, Y. Gaur, Z. Chen, J. Li, and F. Wei, “Viola: Unified codec language models for speech recognition, synthesis, and translation,” arXiv preprint arXiv:2305.16107, 2023.
  110. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  111. Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” 2023. [Online]. Available: https://arxiv.org/abs/2303.03926
  112. J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” arXiv preprint arXiv:2306.05284, 2023.
  113. A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
  114. A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi et al., “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
  115. X. Wang, M. Thakker, Z. Chen, N. Kanda, S. Emre Eskimez, S. Chen, M. Tang, S. Liu, J. Li, and T. Yoshioka, “Speechx: Neural codec language model as a versatile speech transformer,” https://arxiv.org/pdf/2308.06873.pdf, 2023.
  116. Y. Cheng, Y. Zhang, M. Johnson, W. Macherey, and A. Bapna, “Mu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT slam: Multitask, multilingual speech and language models,” in International Conference on Machine Learning.   PMLR, 2023, pp. 5504–5520.
  117. R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y. Wu, Z. Hong, J. Huang, J. Liu et al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,” arXiv preprint arXiv:2304.12995, 2023.
  118. S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” arXiv preprint arXiv:2305.11834, 2023.
  119. C. e. a. Seamless Communication, Loïc Barrault, “”seamlessm4t—massively multilingual & multimodal machine translation”,” arXiv preprint arXiv:2308.11596, 2023.
  120. S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” 2023.
  121. Q. Huang, A. Jansen, J. Lee, R. Ganti, J. Y. Li, and D. P. Ellis, “Mulan: A joint embedding of music audio and natural language,” arXiv preprint arXiv:2208.12415, 2022.
  122. X. Liu, Z. Zhu, H. Liu, Y. Yuan, M. Cui, Q. Huang, J. Liang, Y. Cao, Q. Kong, M. D. Plumbley et al., “Wavjourney: Compositional audio creation with large language models,” arXiv preprint arXiv:2307.14335, 2023.
  123. H. Inaguma, S. Popuri, I. Kulikov, P.-J. Chen, C. Wang, Y.-A. Chung, Y. Tang, A. Lee, S. Watanabe, and J. Pino, “Unity: Two-pass direct speech-to-speech translation with discrete units,” arXiv preprint arXiv:2212.08055, 2022.
  124. J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
  125. R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.
  126. B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng et al., “Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6182–6186.
  127. G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang et al., “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
  128. M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi, “Must-c: a multilingual speech translation corpus,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).   Association for Computational Linguistics, 2019, pp. 2012–2017.
  129. C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv preprint arXiv:2101.00390, 2021.
  130. C. Wang, J. Pino, A. Wu, and J. Gu, “Covost: A diverse multilingual speech-to-text translation corpus,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4197–4203.
  131. Y. Jia, M. T. Ramanovich, Q. Wang, and H. Zen, “Cvss corpus and massively multilingual speech-to-speech translation,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 6691–6703.
  132. M. Wester, “The emime bilingual database,” The University of Edinburgh, Tech. Rep., 2010.
  133. C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
  134. K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 736–740.
  135. H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and Y.-H. Yang, “Emopia: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation,” arXiv preprint arXiv:2108.01374, 2021.
  136. J. Ens and P. Pasquier, “Building the metamidi dataset: Linking symbolic and audio musical data.” in ISMIR, 2021, pp. 182–188.
  137. G. Meseguer-Brocal, A. Cohen-Hadria, and G. Peeters, “Creating dali, a large dataset of synchronized audio, lyrics, and notes,” Transactions of the International Society for Music Information Retrieval, vol. 3, no. 1, 2020.
  138. H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large-scale audio-visual dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 721–725.
  139. E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “Fsd50k: an open dataset of human-labeled sound events,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021.
  140. J. Liu, Y. Dong, Z. Cheng, X. Zhang, X. Li, F. Yu, and M. Sun, “Symphony generation with permutation invariant language model,” arXiv preprint arXiv:2205.05448, 2022.
  141. D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra, “The mtg-jamendo dataset for automatic music tagging.”   ICML, 2019.
  142. Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai, X. Gu, and G. Xia, “Pop909: A pop-song dataset for music arrangement generation,” arXiv preprint arXiv:2008.07142, 2020.
  143. M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “Fma: A dataset for music analysis,” arXiv preprint arXiv:1612.01840, 2016.
  144. P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5579–5588.
  145. K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 816–16 825.
  146. K. Deng, Z. Yang, S. Watanabe, Y. Higuchi, G. Cheng, and P. Zhang, “Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 8522–8526.
  147. Y. Higuchi, B. Yan, S. Arora, T. Ogawa, T. Kobayashi, and S. Watanabe, “Bert meets ctc: New formulation of end-to-end speech recognition with pre-trained masked language model,” in Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 5486–5503.
  148. Y. Higuchi, T. Ogawa, T. Kobayashi, and S. Watanabe, “Bectra: Transducer-based end-to-end asr with bert-enhanced encoder,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  149. J. Wu, Y. Gaur, Z. Chen, L. Zhou, Y. Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liu et al., “On decoder-only architecture for speech-to-text and large language model integration,” arXiv preprint arXiv:2307.03917, 2023.
  150. P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yue et al., “Llama-adapter v2: Parameter-efficient visual instruction model,” arXiv preprint arXiv:2304.15010, 2023.
  151. Y. Kubo, S. Karita, and M. Bacchiani, “Knowledge transfer from large-scale pretrained language models to end-to-end speech recognizers,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 8512–8516.
  152. S. Ling, Y. Hu, S. Qian, G. Ye, Y. Qian, Y. Gong, E. Lin, and M. Zeng, “Adapting large language model with speech for fully formatted end-to-end speech recognition,” arXiv preprint arXiv:2307.08234, 2023.
  153. P. He, B. Peng, L. Lu, S. Wang, J. Mei, Y. Liu, R. Xu, H. H. Awadalla, Y. Shi, C. Zhu et al., “Z-code++: A pre-trained language model optimized for abstractive summarization,” arXiv preprint arXiv:2208.09770, 2022.
  154. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  155. W. R. Huang, H. Zhang, S. Kumar, S.-y. Chang, and T. N. Sainath, “Semantic segmentation with bidirectional language models improves long-form asr,” arXiv preprint arXiv:2305.18419, 2023.
  156. W. R. Huang, C. Peyser, T. N. Sainath, R. Pang, T. Strohman, and S. Kumar, “Sentence-select: Large-scale language model data selection for rare-word speech recognition,” arXiv preprint arXiv:2203.05008, 2022.
  157. V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” arXiv preprint arXiv:2012.03411, 2020.
  158. L. Zhuo, R. Yuan, J. Pan, Y. Ma, Y. LI, G. Zhang, S. Liu, R. Dannenberg, J. Fu, C. Lin et al., “Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whispering to chatgpt,” arXiv preprint arXiv:2306.17103, 2023.
  159. D. Stoller, S. Durand, and S. Ewert, “End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 181–185.
  160. J. K. Hansen and I. Fraunhofer, “Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients,” in 9th Sound and Music Computing Conference (SMC), 2012, pp. 494–499.
  161. K. Schulze-Forster, C. S. Doire, G. Richard, and R. Badeau, “Phoneme level lyrics alignment and text-informed singing voice separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2382–2395, 2021.
  162. G. R. Dabike and J. Barker, “Automatic lyric transcription from karaoke vocal tracks: Resources and a baseline system.” in Interspeech, 2019, pp. 579–583.
  163. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” Proc. Interspeech 2017, pp. 4006–4010, 2017.
  164. S. Ö. Arık, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman et al., “Deep voice: Real-time neural text-to-speech,” in International Conference on Machine Learning.   PMLR, 2017, pp. 195–204.
  165. W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” in International Conference on Learning Representations, 2018.
  166. D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984.
  167. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” https://arxiv.org/abs/1609.03499, 2016.
  168. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 3617–3621.
  169. S. Kakouros, J. Šimko, M. Vainio, and A. Suni, “Investigating the utility of surprisal from large language models for speech synthesis prosody,” arXiv preprint arXiv:2306.09814, 2023.
  170. M. Hassid, T. Remez, T. A. Nguyen, I. Gat, A. Conneau, F. Kreuk, J. Copet, A. Defossez, G. Synnaeve, E. Dupoux et al., “Textually pretrained speech language models,” arXiv preprint arXiv:2305.13009, 2023.
  171. E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” arXiv preprint arXiv:2302.03540, 2023.
  172. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880.
  173. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  174. R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” in 54th Annual Meeting of the Association for Computational Linguistics.   Association for Computational Linguistics (ACL), 2016, pp. 86–96.
  175. S. Maiti, Y. Peng, T. Saeki, and S. Watanabe, “Speechlmscore: Evaluating speech generation using speech language model,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  176. W.-C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The voicemos challenge 2022,” arXiv preprint arXiv:2203.11389, 2022.
  177. Z. Wang, Y. Chen, L. Xie, Q. Tian, and Y. Wang, “Lm-vc: Zero-shot voice conversion via speech generation based on language models,” arXiv preprint arXiv:2306.10521, 2023.
  178. Z. Wang, S. Mao, W. Wu, Y. Xia, Y. Deng, and J. Tien, “Assessing phrase break of esl speech with pre-trained language models and large language models,” arXiv preprint arXiv:2306.04980, 2023.
  179. E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in International Conference on Machine Learning.   PMLR, 2022, pp. 2709–2720.
  180. E. Matusov, S. Kanthak, and H. Ney, “On the integration of speech recognition and statistical machine translation,” in Ninth European Conference on Speech Communication and Technology, 2005.
  181. A. Bérard, O. Pietquin, L. Besacier, and C. Servan, “Listen and translate: A proof of concept for end-to-end speech-to-text translation,” in NIPS Workshop on end-to-end learning for speech and audio processing, 2016.
  182. Q. Dong, Z. Huang, C. Xu, Y. Zhao, K. Wang, X. Cheng, T. Ko, Q. Tian, T. Li, F. Yue et al., “Polyvoice: Language models for speech to speech translation,” arXiv preprint arXiv:2306.02982, 2023.
  183. M. J. Gales, K. M. Knill, and A. Ragni, “Low-resource speech recognition and keyword-spotting,” in Speech and Computer: 19th International Conference, SPECOM 2017, Hatfield, UK, September 12-16, 2017, Proceedings 19.   Springer, 2017, pp. 3–19.
  184. C. Le, Y. Qian, L. Zhou, S. Liu, M. Zeng, and X. Huang, “Comsl: A composite speech-language model for end-to-end speech-to-text translation,” arXiv preprint arXiv:2305.14838, 2023.
  185. H. Wu, K.-W. Chang, Y.-K. Wu, and H.-y. Lee, “Speechgen: Unlocking the generative power of speech language models with prompts,” arXiv preprint arXiv:2306.02207, 2023.
  186. V. W. Zue and J. R. Glass, “Conversational interfaces: Advances and challenges,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1166–1180, 2000.
  187. V. Vlasov, J. E. Mosig, and A. Nichol, “Dialogue transformers,” arXiv preprint arXiv:1910.00486, 2019.
  188. Y. Wang, S. Joty, M. R. Lyu, I. King, C. Xiong, and S. C. Hoi, “Vd-bert: A unified vision and dialog transformer with bert,” arXiv preprint arXiv:2004.13278, 2020.
  189. R. Zhang, T. Wu, X. Chen, S. Wen, S. Nepal, C. Paris, and Y. Xiang, “Dynalogue: A transformer-based dialogue system with dynamic attention,” in Proceedings of the ACM Web Conference 2023, 2023, pp. 1604–1615.
  190. J. Gao, M. Galley, and L. Li, “Neural approaches to conversational ai,” in The 41st international ACM SIGIR conference on research & development in information retrieval, 2018, pp. 1371–1374.
  191. I. V. Serban, R. Lowe, P. Henderson, L. Charlin, and J. Pineau, “A survey of available corpora for building data-driven dialogue systems: The journal version,” Dialogue & Discourse, vol. 9, no. 1, pp. 1–49, 2018.
  192. Y. Fan and X. Luo, “A survey of dialogue system evaluation,” in 32nd IEEE International Conference on Tools with Artificial Intelligence ICTAI.   IEEE, 2020.
  193. Z. Hu, Y. Feng, A. T. Luu, B. Hooi, and A. Lipani, “Unlocking the potential of user feedback: Leveraging large language model as user simulator to enhance dialogue system,” CoRR, vol. abs/2306.09821, 2023.
  194. V. Hudeček and O. Dušek, “Are llms all you need for task-oriented dialogue?” arXiv preprint arXiv:2304.06556, 2023.
  195. J. Wei, S. Kim, H. Jung, and Y. Kim, “Leveraging large language models to power chatbots for collecting user self-reported data,” CoRR, vol. abs/2301.05843, 2023.
  196. T. A. Nguyen, E. Kharitonov, J. Copet, Y. Adi, W. Hsu, A. Elkahky, P. Tomasello, R. Algayres, B. Sagot, A. Mohamed, and E. Dupoux, “Generative spoken dialogue language modeling,” CoRR, vol. abs/2203.16502, 2022.
  197. A. López-Zorrilla, M. I. Torres, and H. Cuayáhuitl, “Audio embedding-aware dialogue policy learning,” IEEE ACM Trans. Audio Speech Lang. Process. (TASLP), vol. 3, 2023.
  198. N. Cherakara, F. Varghese, S. Shabana, N. Nelson, A. Karukayil, R. Kulothungan, M. A. Farhan, B. Nesset, M. Moujahid, T. Dinkar et al., “Furchat: An embodied conversational agent using llms, combining open and closed-domain dialogue with facial expressions,” arXiv preprint arXiv:2308.15214, 2023.
  199. T. A. Nguyen, E. Kharitonov, J. Copet, Y. Adi, W.-N. Hsu, A. Elkahky, P. Tomasello, R. Algayres, B. Sagot, A. Mohamed, and E. Dupoux, “Generative Spoken Dialogue Language Modeling,” Transactions of the Association for Computational Linguistics, vol. 11, 2023. [Online]. Available: https://doi.org/10.1162/tacl_a_00545
  200. X. Liu, Q. Wu, H. Zhou, Y. Du, W. Wu, D. Lin, and Z. Liu, “Audio-driven co-speech gesture video generation,” in NeurIPS, 2022.
  201. T. Gong, C. Lyu, S. Zhang, Y. Wang, M. Zheng, Q. Zhao, K. Liu, W. Zhang, P. Luo, and K. Chen, “Multimodal-gpt: A vision and language model for dialogue with humans,” 2023.
  202. C. Li, “Large multimodal models: Notes on cvpr 2023 tutorial,” 2023.
  203. Z. Li, Z. Li, J. Zhang, Y. Feng, and J. Zhou, “Bridging text and video: A universal multimodal transformer for audio-visual scene-aware dialog,” IEEE ACM Trans. Audio Speech Lang. Process. (TSLP), vol. 29, 2021.
  204. J.-P. Briot and F. Pachet, “Deep learning for music generation: challenges and directions,” Neural Computing and Applications, vol. 32, no. 4, pp. 981–993, 2020.
  205. B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang, T. Qin, and T.-Y. Liu, “Museformer: Transformer with fine-and coarse-grained attention for music generation,” Advances in Neural Information Processing Systems, vol. 35, pp. 1376–1388, 2022.
  206. A. Kumar and P. Sarmento, “From words to music: A study of subword tokenization techniques in symbolic music generation,” arXiv preprint arXiv:2304.08953, 2023.
  207. S. Ji and X. Yang, “Emomusictv: Emotion-conditioned symbolic music generation with hierarchical transformer vae,” IEEE Transactions on Multimedia, 2023.
  208. H. F. Garcia, P. Seetharaman, R. Kumar, and B. Pardo, “Vampnet: Music generation via masked acoustic token modeling,” arXiv preprint arXiv:2307.04686, 2023.
  209. H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 315–11 325.
  210. R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,” arXiv preprint arXiv:2306.06546, 2023.
  211. P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” arXiv preprint arXiv:1803.02155, 2018.
  212. K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fr\\\backslash\’echet audio distance: A metric for evaluating music enhancement algorithms,” arXiv preprint arXiv:1812.08466, 2018.
  213. D. Ghosal, N. Majumder, A. Mehrish, and S. Poria, “Text-to-audio generation using instruction-tuned llm and latent diffusion model,” arXiv preprint arXiv:2304.13731, 2023.
  214. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  215. D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv preprint arXiv:1312.6114, 2013.
  216. K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov, “Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” arXiv preprint arXiv:2308.01546, 2023.
  217. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  218. Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  219. S. Wu and M. Sun, “Exploring the efficacy of pre-trained checkpoints in text-to-music generation task,” arXiv preprint arXiv:2211.11216, 2022.
  220. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  221. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
  222. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019.
  223. Q. Huang, D. S. Park, T. Wang, T. I. Denk, A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. Frank et al., “Noise2music: Text-conditioned music generation with diffusion models,” arXiv preprint arXiv:2302.03917, 2023.
  224. C. Donahue, A. Caillon, A. Roberts, E. Manilow, P. Esling, A. Agostinelli, M. Verzetti, I. Simon, O. Pietquin, N. Zeghidour et al., “Singsong: Generating musical accompaniments from singing,” arXiv preprint arXiv:2301.12662, 2023.
  225. L. Ou, X. Ma, and Y. Wang, “Loaf-m2l: Joint learning of wording and formatting for singable melody-to-lyric generation,” arXiv preprint arXiv:2307.02146, 2023.
  226. W. G. Tech, “Singability dataset,” https://www.kaggle.com/datasets/mateibejan/multilingual-lyrics-for-genre-classification, 2021, accessed: 2022-01-28.
  227. M. W. Lam, Q. Tian, T. Li, Z. Yin, S. Feng, M. Tu, Y. Ji, R. Xia, M. Ma, X. Song et al., “Efficient neural music generation,” arXiv preprint arXiv:2305.15719, 2023.
  228. P. Lu, X. Xu, C. Kang, B. Yu, C. Xing, X. Tan, and J. Bian, “Musecoco: Generating symbolic music from text,” arXiv preprint arXiv:2306.00110, 2023.
  229. Y.-S. Huang and Y.-H. Yang, “Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 1180–1188.
  230. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119.   PMLR, 13–18 Jul 2020, pp. 5156–5165. [Online]. Available: https://proceedings.mlr.press/v119/katharopoulos20a.html
  231. S. Xu, Y. Tang, and F. Zheng, “Launchpadgpt: Language model as music visualization designer on launchpad,” arXiv preprint arXiv:2307.04827, 2023.
  232. S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn architectures for large-scale audio classification,” in 2017 ieee international conference on acoustics, speech and signal processing (icassp).   IEEE, 2017, pp. 131–135.
  233. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
  234. S. Forsgren and H. Martiros, “Riffusion-stable diffusion for real-time music generation, 2022,” URL https://riffusion. com/about, vol. 6.
  235. MubertAI, “Mubert: A simple notebook demonstrating prompt-based music generation., 2022,” https://github.com/MubertAI/Mubert-Text-to-Music.
  236. P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,” arXiv preprint arXiv:2005.00341, 2020.
  237. Y.-J. Shih, H.-F. Wang, H.-J. Chang, L. Berry, H.-y. Lee, and D. Harwath, “Speechclip: Integrating speech with pre-trained vision and language model,” in 2022 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2023, pp. 715–722.
  238. K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 646–650.
  239. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  240. S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” arXiv preprint arXiv:2212.09058, 2022.
  241. L. Xu, L. Wang, S. Bi, H. Liu, and J. Wang, “Semi-supervised sound event detection with pre-trained model,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  242. N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” in Workshop on Detection and Classification of Acoustic Scenes and Events, 2019.
  243. S. Latif, M. Usama, M. I. Malik, and B. W. Schuller, “Can large language models aid in annotating speech emotional data? uncovering new frontiers,” arXiv preprint arXiv:2307.06090, 2023.
  244. S. Yu, D. Siwei, C. Guangyao, H. Wenhao, Z. Ruihua, S. Daochen, X. Qiqi, and Y. Shi1, “Llasm: Large language and speech model,” https://arxiv.org/pdf/2308.15930.pdf, 2023.
  245. K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini, “Deduplicating training data makes language models better,” 2021.
  246. A. Jacovi, A. Caciularu, O. Goldman, and Y. Goldberg, “Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks,” arXiv preprint arXiv:2305.10160, 2023. [Online]. Available: https://arxiv.org/abs/2305.10160
  247. H. Li, D. Guo, W. Fan, M. Xu, and Y. Song, “Multi-step jailbreaking privacy attacks on chatgpt,” arXiv preprint arXiv:2304.05197, 2023. [Online]. Available: https://arxiv.org/abs/2304.05197
  248. N. Lukas, A. Salem, R. Sim, S. Tople, L. Wutschitz, and S. Zanella-Béguelin, “Analyzing leakage of personally identifiable information in language models,” arXiv preprint arXiv:2302.00539, 2023. [Online]. Available: https://arxiv.org/abs/2302.00539
  249. A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?” arXiv preprint arXiv:2307.02483, 2023. [Online]. Available: https://arxiv.org/abs/2307.02483
  250. S. Kim, S. Yun, H. Lee, M. Gubri, S. Yoon, and S. Oh, “Propile: Probing privacy leakage in large language models,” arXiv preprint arXiv:2307.01881, 2023. [Online]. Available: https://arxiv.org/abs/2307.01881
  251. C. Patsakis and N. Lykousas, “Man vs the machine: The struggle for effective text anonymisation in the age of large language models,” arXiv preprint arXiv:2303.12429, 2023. [Online]. Available: https://arxiv.org/abs/2303.12429
  252. N. Carlini, C. Liu, U. Erlingsson, J. Kos, and D. Song, “The secret sharer: Evaluating and testing unintended memorization in neural networks,” in 28th USENIX Security Symposium (USENIX Security 19).   USENIX Association, 2019, pp. 267–284.
  253. N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown et al., “Extracting training data from large language models,” in 30th USENIX Security Symposium (USENIX Security 21).   USENIX Association, 2021.
  254. L. Shen, Y. Sun, Z. Yu, L. Ding, X. Tian, and D. Tao, “On efficient training of large-scale deep learning models: A literature review,” arXiv preprint arXiv:2304.03589, 2023.
  255. A. Lee, B. Miranda, and S. Koyejo, “Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data,” arXiv preprint arXiv:2306.13840, 2023.
  256. S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts, B. Zoph, D. Zhou, J. Wei, K. Robinson, D. Mimno et al., “A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity,” arXiv preprint arXiv:2305.13169, 2023.
  257. S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu, P. Liang, Q. V. Le, T. Ma, and A. W. Yu, “Doremi: Optimizing data mixtures speeds up language model pretraining,” arXiv preprint arXiv:2305.10429, 2023.
  258. S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gallé, A. Raja, C. Si, W. Y. Lee, B. Sagot et al., “Between words and characters: a brief history of open-vocabulary modeling and tokenization in nlp,” arXiv preprint arXiv:2112.10508, 2021.
  259. A. Petrov, E. La Malfa, P. H. Torr, and A. Bibi, “Language model tokenizers introduce unfairness between languages,” arXiv preprint arXiv:2305.15425, 2023.
  260. A. Banerjee and V. Arora, “wav2tok: Deep sequence tokenizer for audio retrieval,” in The Eleventh International Conference on Learning Representations, 2022.
  261. X. Zhang, D. Zhang, S. Li, Y. Zhou, and X. Qiu, “Speechtokenizer: Unified speech tokenizer for speech large language models,” arXiv preprint arXiv:2308.16692, 2023.
  262. N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of language models with mixture-of-experts,” in International Conference on Machine Learning.   PMLR, 2022, pp. 5547–5569.
  263. G. Park, B. Park, S. J. Kwon, B. Kim, Y. Lee, and D. Lee, “nuqmm: Quantized matmul for efficient inference of large-scale generative language models,” arXiv preprint arXiv:2206.09557, 2022.
  264. R.-J. Zhu, Q. Zhao, and J. K. Eshraghian, “Spikegpt: Generative pre-trained language model with spiking neural networks,” arXiv preprint arXiv:2302.13939, 2023.
  265. M. C. Rillig, M. Ågerstrand, M. Bi, K. A. Gould, and U. Sauerland, “Risks and benefits of large language models for the environment,” Environmental Science & Technology, vol. 57, no. 9, pp. 3464–3466, 2023.
  266. Z. Hu, Y. Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, and S. Poria, “Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models,” arXiv preprint arXiv:2304.01933, 2023.
  267. T. Susnjak, “Prisma-dfllm: An extension of prisma for systematic literature reviews using domain-specific finetuned large language models,” arXiv preprint arXiv:2306.14905, 2023.
  268. A. Chavan, Z. Liu, D. Gupta, E. Xing, and Z. Shen, “One-for-all: Generalized lora for parameter-efficient fine-tuning,” arXiv preprint arXiv:2306.07967, 2023.
  269. W.-C. Huang, C.-H. Wu, S.-B. Luo, K.-Y. Chen, H.-M. Wang, and T. Toda, “Speech recognition by simply fine-tuning bert,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 7343–7347.
  270. Y. Li, Y. Wu, J. Li, and S. Liu, “Prompting large language models for zero-shot domain adaptation in speech recognition,” arXiv preprint arXiv:2306.16007, 2023.
  271. S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, “Survey of deep representation learning for speech emotion recognition,” IEEE Transactions on Affective Computing, 2021.
  272. J. Ainslie, T. Lei, M. de Jong, S. Ontañón, S. Brahma, Y. Zemlyanskiy, D. Uthus, M. Guo, J. Lee-Thorp, Y. Tay et al., “Colt5: Faster long-range transformers with conditional computation,” arXiv preprint arXiv:2303.09752, 2023.
  273. J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, and F. Wei, “Longnet: Scaling transformers to 1,000,000,000 tokens,” arXiv preprint arXiv:2307.02486, 2023.
  274. J. Kaddour, O. Key, P. Nawrot, P. Minervini, and M. J. Kusner, “No train no gain: Revisiting efficient training algorithms for transformer-based language models,” arXiv preprint arXiv:2307.06440, 2023.
  275. S. Tworkowski, K. Staniszewski, M. Pacek, Y. Wu, H. Michalewski, and P. Miłoś, “Focused transformer: Contrastive training for context scaling,” arXiv preprint arXiv:2307.03170, 2023.
  276. A. Haviv, O. Ram, O. Press, P. Izsak, and O. Levy, “Transformer language models without positional encodings still learn positional information,” arXiv preprint arXiv:2203.16634, 2022.
  277. B. Liu, J. T. Ash, S. Goel, A. Krishnamurthy, and C. Zhang, “Transformers learn shortcuts to automata,” arXiv preprint arXiv:2210.10749, 2022.
  278. J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” arXiv preprint arXiv:2104.09864, 2021.
  279. J. Geiping and T. Goldstein, “Cramming: Training a language model on a single gpu in one day.” in International Conference on Machine Learning.   PMLR, 2023, pp. 11 117–11 143.
  280. Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,” arXiv preprint arXiv:2307.08621, 2023.
  281. T. Dao, D. Y. Fu, K. K. Saab, A. W. Thomas, A. Rudra, and C. Ré, “Hungry hungry hippos: Towards language modeling with state space models,” arXiv preprint arXiv:2212.14052, 2022.
  282. A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De, “Resurrecting recurrent neural networks for long sequences,” arXiv preprint arXiv:2303.06349, 2023.
  283. B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV et al., “Rwkv: Reinventing rnns for the transformer era,” arXiv preprint arXiv:2305.13048, 2023.
  284. S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Deep architecture enhancing robustness to noise, adversarial attacks, and cross-corpus setting for speech emotion recognition,” Proc. Interspeech 2020, pp. 2327–2331, 2020.
  285. M. Cascella, J. Montomoli, V. Bellini, and E. Bignami, “Evaluating the feasibility of chatgpt in healthcare: an analysis of multiple clinical and research scenarios,” Journal of Medical Systems, vol. 47, no. 1, p. 33, 2023.
  286. S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” arXiv preprint arXiv:2303.17564, 2023.
  287. J. Cui, Z. Li, Y. Yan, B. Chen, and L. Yuan, “Chatlaw: Open-source legal large language model with integrated external knowledge bases,” arXiv preprint arXiv:2306.16092, 2023.
  288. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  289. V. Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922, 2023.
  290. P. Manakul, A. Liusie, and M. J. Gales, “Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models,” arXiv preprint arXiv:2303.08896, 2023.
  291. P. Feldman, J. R. Foulds, and S. Pan, “Trapping llm hallucinations using tagged context prompts,” arXiv preprint arXiv:2306.06085, 2023.
  292. N. McKenna, T. Li, L. Cheng, M. J. Hosseini, M. Johnson, and M. Steedman, “Sources of hallucination by large language models on inference tasks,” arXiv preprint arXiv:2305.14552, 2023.
  293. W. Sun, Z. Shi, S. Gao, P. Ren, M. de Rijke, and Z. Ren, “Contrastive learning reduces hallucination in conversations,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 618–13 626.
  294. S. Latif, H. S. Ali, M. Usama, R. Rana, B. Schuller, and J. Qadir, “Ai-based emotion recognition: Promise, peril, and prescriptions for prosocial path,” arXiv preprint arXiv:2211.07290, 2022.
  295. P. P. Ray, “Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope,” Internet of Things and Cyber-Physical Systems, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Siddique Latif (38 papers)
  2. Moazzam Shoukat (2 papers)
  3. Fahad Shamshad (21 papers)
  4. Muhammad Usama (40 papers)
  5. Yi Ren (215 papers)
  6. Wenwu Wang (148 papers)
  7. Xulong Zhang (60 papers)
  8. Roberto Togneri (30 papers)
  9. Erik Cambria (136 papers)
  10. Björn W. Schuller (153 papers)
  11. Heriberto Cuayáhuitl (12 papers)
Citations (23)
Youtube Logo Streamline Icon: https://streamlinehq.com