Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4 (2310.12321v1)

Published 4 Oct 2023 in cs.CL

Abstract: LLMs are a special class of pretrained LLMs obtained by scaling model size, pretraining corpus and computation. LLMs, because of their large size and pretraining on large volumes of text data, exhibit special abilities which allow them to achieve remarkable performances without any task-specific training in many of the natural language processing tasks. The era of LLMs started with OpenAI GPT-3 model, and the popularity of LLMs is increasing exponentially after the introduction of models like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family LLMs (GLLMs). With the ever-rising popularity of GLLMs, especially in the research community, there is a strong need for a comprehensive survey which summarizes the recent research progress in multiple dimensions and can guide the research community with insightful future research directions. We start the survey paper with foundation concepts like transformers, transfer learning, self-supervised learning, pretrained LLMs and LLMs. We then present a brief overview of GLLMs and discuss the performances of GLLMs in various downstream tasks, specific domains and multiple languages. We also discuss the data labelling and data augmentation abilities of GLLMs, the robustness of GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with multiple insightful future research directions. To summarize, this comprehensive survey paper will serve as a good resource for both academic and industry people to stay updated with the latest research related to GPT-3 family LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (516)
  1. K. S. Kalyan, A. Rajasekharan, and S. Sangeetha, “Ammus: A survey of transformer-based pretrained models in natural language processing,” arXiv preprint arXiv:2108.05542, 2021.
  2. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  3. K. S. Kalyan, A. Rajasekharan, and S. Sangeetha, “Ammu: a survey of transformer-based biomedical pretrained language models,” Journal of biomedical informatics, vol. 126, p. 103982, 2022.
  4. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  5. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  6. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
  7. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the association for computational linguistics, vol. 5, pp. 135–146, 2017.
  8. N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014, pp. 655–665.
  9. H. Salehinejad, S. Sankar, J. Barfett, E. Colak, and S. Valaee, “Recent advances in recurrent neural networks,” arXiv preprint arXiv:1801.01078, 2017.
  10. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  11. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
  12. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
  13. D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.
  14. M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.
  15. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  16. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  17. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in 3rd International Conference on Learning Representations (ICLR 2015).   Computational and Biological Learning Society, 2015.
  18. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  19. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  20. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training.”
  21. X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” IEEE transactions on knowledge and data engineering, vol. 35, no. 1, pp. 857–876, 2021.
  22. J. Gui, T. Chen, Q. Cao, Z. Sun, H. Luo, and D. Tao, “A survey of self-supervised learning from multiple perspectives: Algorithms, theory, applications and future trends,” arXiv preprint arXiv:2301.05712, 2023.
  23. Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019.
  24. K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” in International Conference on Learning Representations, 2019.
  25. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in International Conference on Learning Representations, 2019.
  26. P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing,” in The Eleventh International Conference on Learning Representations, 2022.
  27. P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” in International Conference on Learning Representations, 2020.
  28. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  29. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  30. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880.
  31. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  32. J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022.
  33. N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of language models with mixture-of-experts,” in International Conference on Machine Learning.   PMLR, 2022, pp. 5547–5569.
  34. R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
  35. J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021.
  36. S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” arXiv preprint arXiv:2201.11990, 2022.
  37. T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
  38. R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large language model for science,” arXiv preprint arXiv:2211.09085, 2022.
  39. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
  40. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  41. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  42. OpenAI, “Gpt-4 technical report,” 2023.
  43. S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
  44. W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  45. Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022.
  46. Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, L. Yang, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” arXiv preprint arXiv:2307.03109, 2023.
  47. Z. Zhuang, Q. Chen, L. Ma, M. Li, Y. Han, Y. Qian, H. Bai, Z. Feng, W. Zhang, and T. Liu, “Through the lens of core competency: Survey on evaluation of large language models,” arXiv preprint arXiv:2308.07902, 2023.
  48. Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu, “Aligning large language models with human: A survey,” arXiv preprint arXiv:2307.12966, 2023.
  49. Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trustworthy llms: a survey and guideline for evaluating large language models’ alignment,” arXiv preprint arXiv:2308.05374, 2023.
  50. X. Huang, W. Ruan, W. Huang, G. Jin, Y. Dong, C. Wu, S. Bensalem, R. Mu, Y. Qi, X. Zhao et al., “A survey of safety and trustworthiness of large language models through the lens of verification and validation,” arXiv preprint arXiv:2305.11391, 2023.
  51. J. Huang and K. C.-C. Chang, “Towards reasoning in large language models: A survey,” arXiv preprint arXiv:2212.10403, 2022.
  52. J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, “Challenges and applications of large language models,” arXiv preprint arXiv:2307.10169, 2023.
  53. X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on model compression for large language models,” arXiv preprint arXiv:2308.07633, 2023.
  54. S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023.
  55. T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing,” ieee Computational intelligenCe magazine, vol. 13, no. 3, pp. 55–75, 2018.
  56. Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Association for Computational Linguistics, 2014.
  57. K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Association for Computational Linguistics, 2014, p. 1724.
  58. F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
  59. D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning for natural language processing,” IEEE transactions on neural networks and learning systems, vol. 32, no. 2, pp. 604–624, 2020.
  60. X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, A. Zhang, L. Zhang et al., “Pre-trained models: Past, present and future,” AI Open, vol. 2, pp. 225–250, 2021.
  61. S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
  62. J. Blitzer, M. Dredze, and F. Pereira, “Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification,” in Proceedings of the 45th annual meeting of the association of computational linguistics, 2007, pp. 440–447.
  63. J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine learning, vol. 109, no. 2, pp. 373–440, 2020.
  64. Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 12, pp. 5586–5609, 2021.
  65. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  66. M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sentence embeddings using compositional n-gram features,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 528–540.
  67. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).   New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 2227–2237. [Online]. Available: https://aclanthology.org/N18-1202
  68. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  69. T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI Open, 2022.
  70. X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020.
  71. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  72. J. Zhang, Y. Zhao, M. Saleh, and P. Liu, “Pegasus: Pre-training with extracted gap-sentences for abstractive summarization,” in International Conference on Machine Learning.   PMLR, 2020, pp. 11 328–11 339.
  73. S. Doddapaneni, G. Ramesh, M. M. Khapra, A. Kunchukuttan, and P. Kumar, “A primer on pretrained multilingual language models,” arXiv preprint arXiv:2107.00676, 2021.
  74. L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 483–498.
  75. Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020.
  76. D. Kakwani, A. Kunchukuttan, S. Golla, N. Gokul, A. Bhattacharyya, M. M. Khapra, and P. Kumar, “Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4948–4961.
  77. A. Conneau and G. Lample, “Cross-lingual language model pretraining,” Advances in neural information processing systems, vol. 32, 2019.
  78. A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, É. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440–8451.
  79. D. Q. Nguyen, T. Vu, and A. T. Nguyen, “Bertweet: A pre-trained language model for english tweets,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 9–14.
  80. F. Barbieri, J. Camacho-Collados, L. E. Anke, and L. Neves, “Tweeteval: Unified benchmark and comparative evaluation for tweet classification,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1644–1650.
  81. Y. Yang, M. C. S. Uy, and A. Huang, “Finbert: A pretrained language model for financial communications,” arXiv preprint arXiv:2006.08097, 2020.
  82. D. Araci, “Finbert: Financial sentiment analysis with pre-trained language models,” arXiv preprint arXiv:1908.10063, 2019.
  83. Z. Liu, D. Huang, K. Huang, Z. Li, and J. Zhao, “Finbert: A pre-trained financial language representation model for financial text mining,” in Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, 2021, pp. 4513–4519.
  84. I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos, “Legal-bert: The muppets straight out of law school,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 2898–2904.
  85. S. Leivaditi, J. Rossi, and E. Kanoulas, “A benchmark for lease contract review,” arXiv preprint arXiv:2010.10386, 2020.
  86. Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1536–1547.
  87. Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696–8708.
  88. Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi, “Codet5+: Open code large language models for code understanding and generation,” arXiv preprint arXiv:2305.07922, 2023.
  89. J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.
  90. Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” arXiv preprint arXiv:2007.15779, 2020.
  91. K. raj Kanakarajan, B. Kundumani, and M. Sankarasubbu, “Bioelectra: pretrained biomedical text encoder using discriminators,” in Proceedings of the 20th Workshop on Biomedical Language Processing, 2021, pp. 143–154.
  92. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
  93. X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “Tinybert: Distilling bert for natural language understanding,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4163–4174.
  94. Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou, “Mobilebert: a compact task-agnostic bert for resource-limited devices,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2158–2170.
  95. W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers,” arXiv preprint arXiv:2002.10957, 2020.
  96. I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150, 2020.
  97. M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big bird: Transformers for longer sequences.” in NeurIPS, 2020.
  98. F. Liu, E. Shareghi, Z. Meng, M. Basaldella, and N. Collier, “Self-alignment pretraining for biomedical entity representations,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 4228–4238.
  99. G. Michalopoulos, Y. Wang, H. Kaka, H. Chen, and A. Wong, “Umlsbert: Clinical domain knowledge augmentation of contextual embeddings using the unified medical language system metathesaurus,” arXiv preprint arXiv:2010.10391, 2020.
  100. B. Goertzel, “Artificial general intelligence: concept, state of the art, and future prospects,” Journal of Artificial General Intelligence, vol. 5, no. 1, p. 1, 2014.
  101. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” Transactions on Machine Learning Research, 2022.
  102. R. Schaeffer, B. Miranda, and S. Koyejo, “Are emergent abilities of large language models a mirage?” arXiv preprint arXiv:2304.15004, 2023.
  103. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  104. Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022.
  105. A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al., “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.
  106. S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang, Y. Zhao, C. Pang et al., “Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation,” arXiv preprint arXiv:2112.12731, 2021.
  107. O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical details and evaluation,” White Paper. AI21 Labs, vol. 1, 2021.
  108. S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky et al., “Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model,” arXiv preprint arXiv:2208.01448, 2022.
  109. S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura et al., “Opt-iml: Scaling language model instruction meta learning through the lens of generalization,” arXiv preprint arXiv:2212.12017, 2022.
  110. N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf et al., “Crosslingual generalization through multitask finetuning,” arXiv preprint arXiv:2211.01786, 2022.
  111. N. Sengupta, S. K. Sahu, B. Jia, S. Katipomu, H. Li, F. Koto, O. M. Afzal, S. Kamboj, O. Pandit, R. Pal et al., “Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models,” arXiv preprint arXiv:2308.16149, 2023.
  112. A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained model,” in The Eleventh International Conference on Learning Representations, 2022.
  113. X. Li, Y. Yao, X. Jiang, X. Fang, X. Meng, S. Fan, P. Han, J. Li, L. Du, B. Qin et al., “Flm-101b: An open llm and how to train it with 100 k budget,” arXiv preprint arXiv:2309.03852, 2023.
  114. H. Yang, X.-Y. Liu, and C. D. Wang, “Fingpt: Open-source financial large language models,” arXiv preprint arXiv:2306.06031, 2023.
  115. S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” arXiv preprint arXiv:2303.17564, 2023.
  116. K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,” Nature, pp. 1–9, 2023.
  117. K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al., “Towards expert-level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023.
  118. R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
  119. B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
  120. E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” in The Eleventh International Conference on Learning Representations, 2022.
  121. E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou, “Codegen2: Lessons for training llms on programming and natural languages,” arXiv preprint arXiv:2305.02309, 2023.
  122. A. Radford, R. Jozefowicz, and I. Sutskever, “Learning to generate reviews and discovering sentiment,” arXiv preprint arXiv:1704.01444, 2017.
  123. A. M. Dai and Q. V. Le, “Semi-supervised sequence learning,” Advances in neural information processing systems, vol. 28, 2015.
  124. J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 328–339.
  125. B. Zhang, X. Fu, D. Ding, H. Huang, Y. Li, and L. Jing, “Investigating chain-of-thought with chatgpt for stance detection on social media,” arXiv preprint arXiv:2304.03087, 2023.
  126. B. Lamichhane, “Evaluation of chatgpt for nlp-based mental health applications,” arXiv preprint arXiv:2303.15727, 2023.
  127. K. Yang, S. Ji, T. Zhang, Q. Xie, and S. Ananiadou, “On the evaluations of chatgpt and emotion-enhanced prompting for mental health analysis,” arXiv preprint arXiv:2304.03347, 2023.
  128. Z. Wang, Q. Xie, Z. Ding, Y. Feng, and R. Xia, “Is chatgpt a good sentiment analyzer? a preliminary study,” arXiv preprint arXiv:2304.04339, 2023.
  129. A. Lopez-Lira and Y. Tang, “Can chatgpt forecast stock price movements? return predictability and large language models,” arXiv preprint arXiv:2304.07619, 2023.
  130. C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, and D. Yang, “Can large language models transform computational social science?” arXiv preprint arXiv:2305.03514, 2023.
  131. T. Kuzman, N. Ljubešić, and I. Mozetič, “Chatgpt: Beginning of an end of manual annotation? use case of automatic genre identification,” arXiv preprint arXiv:2303.03953, 2023.
  132. Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” arXiv preprint arXiv:2302.04023, 2023.
  133. J. Kocoń, I. Cichecki, O. Kaszyca, M. Kochanek, D. Szydło, J. Baran, J. Bielaniewicz, M. Gruza, A. Janz, K. Kanclerz et al., “Chatgpt: Jack of all trades, master of none,” arXiv preprint arXiv:2302.10724, 2023.
  134. Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, “Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert,” arXiv preprint arXiv:2302.10198, 2023.
  135. J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, Z. Zhou, C. Gong, Y. Shen et al., “A comprehensive capability analysis of gpt-3 and gpt-3.5 series models,” arXiv preprint arXiv:2303.10420, 2023.
  136. X. Li, X. Zhu, Z. Ma, X. Liu, and S. Shah, “Are chatgpt and gpt-4 general-purpose solvers for financial text analytics? an examination on several typical tasks,” arXiv preprint arXiv:2305.05862, 2023.
  137. Z. Wu, L. Zhang, C. Cao, X. Yu, H. Dai, C. Ma, Z. Liu, L. Zhao, G. Li, W. Liu et al., “Exploring the trade-offs: Unified large language models vs local fine-tuned models for highly-specific radiology nli task,” arXiv preprint arXiv:2304.09138, 2023.
  138. Y. Wang, Y. Zhao, and L. Petzold, “Are large language models ready for healthcare? a comparative study on clinical language understanding,” arXiv preprint arXiv:2304.05368, 2023.
  139. K.-L. Chiu, A. Collins, and R. Alexander, “Detecting hate speech with gpt-3,” arXiv preprint arXiv:2103.12407, 2021.
  140. F. Huang, H. Kwak, and J. An, “Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech,” arXiv preprint arXiv:2302.07736, 2023.
  141. S. Chen, Y. Li, S. Lu, H. Van, H. J. Aerts, G. K. Savova, and D. S. Bitterman, “Evaluation of chatgpt family of models for biomedical reasoning and classification,” arXiv preprint arXiv:2304.02496, 2023.
  142. M. M. Amin, E. Cambria, and B. W. Schuller, “Will affective computing emerge from foundation models and general ai? a first evaluation on chatgpt,” IEEE Intelligent Systems, vol. 38, p. 2.
  143. S. Parikh, Q. Vohra, P. Tumbade, and M. Tiwari, “Exploring zero and few-shot techniques for intent classification,” arXiv preprint arXiv:2305.07157, 2023.
  144. X. Sun, X. Li, J. Li, F. Wu, S. Guo, T. Zhang, and G. Wang, “Text classification via large language models,” arXiv preprint arXiv:2305.08377, 2023.
  145. Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, and L. He, “A survey on text classification: From traditional to deep learning,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 2, pp. 1–41, 2022.
  146. C.-E. González-Gallardo, E. Boros, N. Girdhar, A. Hamdi, J. G. Moreno, and A. Doucet, “Yes but.. can chatgpt identify entities in historical documents?” arXiv preprint arXiv:2303.17322, 2023.
  147. Y. Hu, I. Ameer, X. Zuo, X. Peng, Y. Zhou, Z. Li, Y. Li, J. Li, X. Jiang, and H. Xu, “Zero-shot clinical entity recognition using chatgpt,” arXiv preprint arXiv:2303.16416, 2023.
  148. X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang, S. Huang, P. Xie, J. Xu, Y. Chen, M. Zhang et al., “Zero-shot information extraction via chatting with chatgpt,” arXiv preprint arXiv:2302.10205, 2023.
  149. B. J. Gutiérrez, N. McNeal, C. Washington, Y. Chen, L. Li, H. Sun, and Y. Su, “Thinking about gpt-3 in-context learning for biomedical ie? think again,” in Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 4497–4512.
  150. J. Gao, H. Zhao, C. Yu, and R. Xu, “Exploring the feasibility of chatgpt for event extraction,” arXiv preprint arXiv:2303.03836, 2023.
  151. H. Rehana, N. B. Çam, M. Basmaci, Y. He, A. Özgür, and J. Hur, “Evaluation of gpt and bert-based models on identifying protein-protein interactions in biomedical text,” arXiv preprint arXiv:2303.17728, 2023.
  152. C. Yuan, Q. Xie, and S. Ananiadou, “Zero-shot temporal relation extraction with chatgpt,” arXiv preprint arXiv:2304.05454, 2023.
  153. B. Li, G. Fang, Y. Yang, Q. Wang, W. Ye, W. Zhao, and S. Zhang, “Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness,” arXiv preprint arXiv:2304.11633, 2023.
  154. C. Chan, J. Cheng, W. Wang, Y. Jiang, T. Fang, X. Liu, and Y. Song, “Chatgpt evaluation on sentence level relations: A focus on temporal, causal, and discourse relations,” arXiv preprint arXiv:2304.14827, 2023.
  155. X. Xu, Y. Zhu, X. Wang, and N. Zhang, “How to unleash the power of large language models for few-shot relation extraction?” arXiv preprint arXiv:2305.01555, 2023.
  156. Z. Wan, F. Cheng, Z. Mao, Q. Liu, H. Song, J. Li, and S. Kurohashi, “Gpt-re: In-context learning for relation extraction using large language models,” arXiv preprint arXiv:2305.02105, 2023.
  157. C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and D. Yang, “Is chatgpt a general-purpose natural language processing task solver?” arXiv preprint arXiv:2302.06476, 2023.
  158. Y. Ma, Y. Cao, Y. Hong, and A. Sun, “Large language model is not a good few-shot information extractor, but a good reranker for hard samples!” arXiv preprint arXiv:2303.08559, 2023.
  159. S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, J. Li, and G. Wang, “Gpt-ner: Named entity recognition via large language models,” arXiv preprint arXiv:2304.10428, 2023.
  160. D. Stammbach, M. Antoniak, and E. Ash, “Heroes, villains, and victims, and gpt-3: Automated extraction of character roles without training data,” in Proceedings of the 4th Workshop of Narrative Understanding (WNU2022), 2022, pp. 47–56.
  161. S. Wadhwa, S. Amir, and B. C. Wallace, “Revisiting relation extraction in the era of large language models,” arXiv preprint arXiv:2305.05003, 2023.
  162. P. Li, T. Sun, Q. Tang, H. Yan, Y. Wu, X. Huang, and X. Qiu, “Codeie: Large code generation models are better few-shot information extractors,” arXiv preprint arXiv:2305.05711, 2023.
  163. K. Zhang, B. J. Gutiérrez, and Y. Su, “Aligning instruction tasks unlocks large language models as zero-shot relation extractors,” arXiv preprint arXiv:2305.11159, 2023.
  164. Y. Lu, Q. Liu, D. Dai, X. Xiao, H. Lin, X. Han, L. Sun, and H. Wu, “Unified structure generation for universal information extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5755–5772.
  165. Y. Chen, J. Cheng, H. Jiang, L. Liu, H. Zhang, S. Shi, and R. Xu, “Learning from sibling mentions with scalable graph inference in fine-grained entity typing,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2076–2087.
  166. S. S. S. Das, A. Katiyar, R. J. Passonneau, and R. Zhang, “Container: Few-shot named entity recognition via contrastive learning,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 6338–6353.
  167. S. Wu and Y. He, “Enriching pre-trained language model with entity information for relation classification,” in Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 2361–2364.
  168. D. Ye, Y. Lin, P. Li, and M. Sun, “Packed levitated marker for entity and relation extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4904–4917.
  169. K. Zhao, X. Jin, L. Bai, J. Guo, and X. Cheng, “Knowledge-enhanced self-supervised prototypical network for few-shot event detection,” in Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 6266–6275.
  170. Y. Ma, Z. Wang, Y. Cao, M. Li, M. Chen, K. Wang, and J. Shao, “Prompt for extraction? paie: Prompting argument interaction for event argument extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 6759–6774.
  171. X. Du and C. Cardie, “Event extraction by answering (almost) natural questions,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 671–683.
  172. Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate before use: Improving few-shot performance of language models,” in International Conference on Machine Learning.   PMLR, 2021, pp. 12 697–12 706.
  173. Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap et al., “Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 5085–5109.
  174. M. Zaib, W. E. Zhang, Q. Z. Sheng, A. Mahmood, and Y. Zhang, “Conversational question answering: A survey,” Knowledge and Information Systems, vol. 64, no. 12, pp. 3151–3195, 2022.
  175. Y. Chali, S. A. Hasan, and S. R. Joty, “Improving graph-based random walks for complex question answering using syntactic, shallow semantic and extended string subsequence kernels,” Information Processing & Management, vol. 47, no. 6, pp. 843–855, 2011.
  176. A. Torfi, R. A. Shirvani, Y. Keneshloo, N. Tavaf, and E. A. Fox, “Natural language processing advancements by deep learning: A survey,” arXiv preprint arXiv:2003.01200, 2020.
  177. D. Nunes, R. Primi, R. Pires, R. Lotufo, and R. Nogueira, “Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams,” arXiv preprint arXiv:2303.17003, 2023.
  178. Y. Tan, D. Min, Y. Li, W. Li, N. Hu, Y. Chen, and G. Qi, “Evaluation of chatgpt as a question answering system for answering complex questions,” arXiv preprint arXiv:2303.07992, 2023.
  179. Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An empirical study of gpt-3 for few-shot knowledge-based vqa,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3081–3089.
  180. P. Srivastava, T. Ganu, and S. Guha, “Towards zero-shot and few-shot table question answering using gpt-3,” arXiv preprint arXiv:2210.17284, 2022.
  181. S. Zheng, J. Huang, and K. C.-C. Chang, “Why does chatgpt fall short in answering questions faithfully?” arXiv preprint arXiv:2304.10513, 2023.
  182. J. S. Samaan, Y. H. Yeo, N. Rajeev, L. Hawley, S. Abel, W. H. Ng, N. Srinivasan, J. Park, M. Burch, R. Watson et al., “Assessing the accuracy of responses by the language model chatgpt to questions regarding bariatric surgery,” Obesity surgery, pp. 1–7, 2023.
  183. J. Holmes, Z. Liu, L. Zhang, Y. Ding, T. T. Sio, L. A. McGee, J. B. Ashman, X. Li, T. Liu, J. Shen et al., “Evaluating large language models on a highly-specialized topic, radiation oncology physics,” Frontiers in Oncology, vol. 13, p. 1219326.
  184. I. Joshi, R. Budhiraja, H. Dev, J. Kadia, M. O. Ataullah, S. Mitra, D. Kumar, and H. D. Akolekar, “Chatgpt–a blessing or a curse for undergraduate computer science students and instructors?” arXiv preprint arXiv:2304.14993, 2023.
  185. H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, “Capabilities of gpt-4 on medical challenge problems,” arXiv preprint arXiv:2303.13375, 2023.
  186. A. Hamidi and K. Roberts, “Evaluation of ai chatbots for patient-specific ehr questions,” arXiv preprint arXiv:2306.02549, 2023.
  187. J. Savelka, A. Agarwal, C. Bogart, and M. Sakr, “Large language models (gpt) struggle to answer multiple-choice questions about code,” arXiv preprint arXiv:2303.08033, 2023.
  188. M. Bommarito II and D. M. Katz, “Gpt takes the bar exam,” arXiv preprint arXiv:2212.14402, 2022.
  189. J. Pereira, R. Fidalgo, R. Lotufo, and R. Nogueira, “Visconde: Multi-document qa with gpt-3 and neural reranking,” in European Conference on Information Retrieval.   Springer, 2023, pp. 534–543.
  190. R. Gupta, I. Herzog, J. B. Park, J. Weisberger, P. Firouzbakht, V. Ocon, J. Chao, E. S. Lee, and B. A. Mailey, “Performance of chatgpt on the plastic surgery inservice training examination,” Aesthetic surgery journal, p. sjad128, 2023.
  191. Y. Tanaka, T. Nakata, K. Aiga, T. Etani, R. Muramatsu, S. Katagiri, H. Kawai, F. Higashino, M. Enomoto, M. Noda et al., “Performance of generative pretrained transformer on the national medical licensing examination in japan,” medRxiv, pp. 2023–04, 2023.
  192. J. Robinson and D. Wingate, “Leveraging large language models for multiple choice question answering,” in The Eleventh International Conference on Learning Representations, 2022.
  193. Y. Weng, B. Li, F. Xia, M. Zhu, B. Sun, S. He, K. Liu, and J. Zhao, “Large language models need holistically thought in medical conversational qa,” arXiv preprint arXiv:2305.05410, 2023.
  194. S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 3214–3252.
  195. J. Kasai, Y. Kasai, K. Sakaguchi, Y. Yamada, and D. Radev, “Evaluating gpt-4 and chatgpt on japanese medical licensing examinations,” arXiv preprint arXiv:2303.18027, 2023.
  196. W. Gu, “Linguistically informed chatgpt prompts to enhance japanese-chinese machine translation: A case study on attributive clauses,” arXiv preprint arXiv:2303.15587, 2023.
  197. K. Peng, L. Ding, Q. Zhong, L. Shen, X. Liu, M. Zhang, Y. Ouyang, and D. Tao, “Towards making the most of chatgpt for machine translation,” arXiv preprint arXiv:2303.13780, 2023.
  198. W. Jiao, W. Wang, J. Huang, X. Wang, and Z. Tu, “Is chatgpt a good translator? yes with gpt-4 as the engine,” arXiv preprint arXiv:2301.08745, 2023.
  199. A. Hendy, M. Abdelrehim, A. Sharaf, V. Raunak, M. Gabr, H. Matsushita, Y. J. Kim, M. Afify, and H. H. Awadalla, “How good are gpt models at machine translation? a comprehensive evaluation,” arXiv preprint arXiv:2302.09210, 2023.
  200. Y. Gao, R. Wang, and F. Hou, “How to design translation prompts for chatgpt: An empirical study,” arXiv e-prints, pp. arXiv–2304, 2023.
  201. L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi, and Z. Tu, “Document-level machine translation with large language models,” arXiv preprint arXiv:2304.02210, 2023.
  202. W. Zhu, H. Liu, Q. Dong, J. Xu, L. Kong, J. Chen, L. Li, and S. Huang, “Multilingual machine translation with large language models: Empirical results and analysis,” arXiv preprint arXiv:2304.04675, 2023.
  203. C. Lyu, J. Xu, and L. Wang, “New trends in machine translation using large language models: Case examples with chatgpt,” arXiv preprint arXiv:2305.01181, 2023.
  204. M. Karpinska and M. Iyyer, “Large language models effectively leverage document-level context for literary translation, but critical errors persist,” arXiv preprint arXiv:2304.03245, 2023.
  205. Y. Moslem, R. Haque, and A. Way, “Adaptive machine translation with large language models,” arXiv preprint arXiv:2301.13294, 2023.
  206. Z. He, T. Liang, W. Jiao, Z. Zhang, Y. Yang, R. Wang, Z. Tu, S. Shi, and X. Wang, “Exploring human-like translation strategy with large language models,” arXiv preprint arXiv:2305.04118, 2023.
  207. V. Raunak, A. Sharaf, H. H. Awadallah, and A. Menezes, “Leveraging gpt-4 for automatic translation post-editing,” arXiv preprint arXiv:2305.14878, 2023.
  208. V. Raunak, A. Menezes, M. Post, and H. H. Awadallah, “Do gpts produce less literal translations?” arXiv preprint arXiv:2305.16806, 2023.
  209. F. Stahlberg, “Neural machine translation: A review,” Journal of Artificial Intelligence Research, vol. 69, pp. 343–418, 2020.
  210. S. Yang, Y. Wang, and X. Chu, “A survey of deep learning techniques for neural machine translation,” ArXiv, vol. abs/2002.07526, 2020.
  211. Z. Tan, S. Wang, Z. Yang, G. Chen, X. Huang, M. Sun, and Y. Liu, “Neural machine translation: A review of methods, resources, and tools,” AI Open, vol. 1, pp. 5–21, 2020.
  212. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.
  213. Y. Tang, C. Tran, X. Li, P.-J. Chen, N. Goyal, V. Chaudhary, J. Gu, and A. Fan, “Multilingual translation with extensible multilingual pretraining and finetuning,” arXiv preprint arXiv:2008.00401, 2020.
  214. A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Çelebi, G. Wenzek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, and A. Joulin, “Beyond english-centric multilingual machine translation,” ArXiv, vol. abs/2010.11125, 2020.
  215. M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard et al., “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022.
  216. R. Martínez-Cruz, A. J. López-López, and J. Portela, “Chatgpt vs state-of-the-art models: A benchmarking study in keyphrase generation task,” arXiv preprint arXiv:2304.14177, 2023.
  217. M. Song, H. Jiang, S. Shi, S. Yao, S. Lu, Y. Feng, H. Liu, and L. Jing, “Is chatgpt a good keyphrase generator? a preliminary study,” arXiv preprint arXiv:2303.13001, 2023.
  218. W. Pan, Q. Chen, X. Xu, W. Che, and L. Qin, “A preliminary evaluation of chatgpt for zero-shot dialogue understanding,” arXiv preprint arXiv:2304.04256, 2023.
  219. W. Zhao, Y. Zhao, X. Lu, S. Wang, Y. Tong, and B. Qin, “Is chatgpt equipped with emotional dialogue capabilities?” arXiv preprint arXiv:2304.09582, 2023.
  220. B. Chintagunta, N. Katariya, X. Amatriain, and A. Kannan, “Medically aware gpt-3 as a data generator for medical dialogue summarization,” in Machine Learning for Healthcare Conference.   PMLR, 2021, pp. 354–372.
  221. G. P. Prodan and E. Pelican, “Prompt scoring system for dialogue summarization using gpt-3,” ACM Transaction on Audio, Speech, and Language Processing, pp. 1–9, 2022.
  222. J. Huynh, C. Jiao, P. Gupta, S. Mehri, P. Bajaj, V. Chaudhary, and M. Eskenazi, “Understanding the effectiveness of very large language models on dialog evaluation,” arXiv preprint arXiv:2301.12004, 2023.
  223. Y. Fan and F. Jiang, “Uncovering the potential of chatgpt for discourse analysis in dialogue: An empirical study,” arXiv preprint arXiv:2305.08391, 2023.
  224. H. Wang, R. Wang, F. Mi, Z. Wang, R. Xu, and K.-F. Wong, “Chain-of-thought prompting for responding to in-depth dialogue questions with llm,” arXiv preprint arXiv:2305.11792, 2023.
  225. R. Meng, X. Yuan, T. Wang, S. Zhao, A. Trischler, and D. He, “An empirical study on neural keyphrase generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 4985–5007.
  226. X. Yuan, T. Wang, R. Meng, K. Thaker, P. Brusilovsky, D. He, and A. Trischler, “One size does not fit all: Generating and evaluating variable number of keyphrases,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7961–7975.
  227. M. Kulkarni, D. Mahata, R. Arora, and R. Bhowmik, “Learning rich representation of keyphrases from text,” in Findings of the Association for Computational Linguistics: NAACL 2022.   Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 891–906. [Online]. Available: https://aclanthology.org/2022.findings-naacl.67
  228. I. V. Serban, R. Lowe, P. Henderson, L. Charlin, and J. Pineau, “A survey of available corpora for building data-driven dialogue systems: The journal version,” Dialogue & Discourse, vol. 9, no. 1, pp. 1–49, 2018.
  229. S. Larson and K. Leach, “A survey of intent classification and slot-filling datasets for task-oriented dialog,” arXiv preprint arXiv:2207.13211, 2022.
  230. W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren, “Is chatgpt good at search? investigating large language models as re-ranking agent,” arXiv preprint arXiv:2304.09542, 2023.
  231. N. Ziems, W. Yu, Z. Zhang, and M. Jiang, “Large language models are built-in autoregressive search engines,” arXiv preprint arXiv:2305.09612, 2023.
  232. A. Anand, L. Lyu, M. Idahl, Y. Wang, J. Wallat, and Z. Zhang, “Explainable information retrieval: A survey,” arXiv preprint arXiv:2211.02405, 2022.
  233. R. Nogueira, Z. Jiang, R. Pradeep, and J. Lin, “Document ranking with a pretrained sequence-to-sequence model,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 708–718.
  234. N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, and I. Gurevych, “Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  235. W. X. Zhao, J. Liu, R. Ren, and J.-R. Wen, “Dense text retrieval based on pretrained language models: A survey,” arXiv preprint arXiv:2211.14876, 2022.
  236. G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions,” IEEE transactions on knowledge and data engineering, vol. 17, no. 6, pp. 734–749, 2005.
  237. Y. Peng, “A survey on modern recommendation system based on big data,” arXiv preprint arXiv:2206.02631, 2022.
  238. F. Rezaimehr and C. Dadkhah, “A survey of attack detection approaches in collaborative filtering recommender systems,” Artificial Intelligence Review, vol. 54, pp. 2011–2066, 2021.
  239. Y. Xie, J. Gao, P. Zhou, Q. Ye, Y. Hua, J. Kim, F. Wu, and S. Kim, “Rethinking multi-interest learning for candidate matching in recommender systems,” arXiv preprint arXiv:2302.14532, 2023.
  240. M. Dong, X. Zeng, L. Koehl, and J. Zhang, “An interactive knowledge-based recommender system for fashion product design in the big data environment,” Information Sciences, vol. 540, pp. 469–488, 2020.
  241. Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, and J. Zhang, “Chat-rec: Towards interactive and explainable llms-augmented recommender system,” arXiv preprint arXiv:2303.14524, 2023.
  242. F. Zhu, Y. Wang, C. Chen, J. Zhou, L. Li, and G. Liu, “Cross-domain recommendation: challenges, progress, and prospects,” arXiv preprint arXiv:2103.01696, 2021.
  243. L. Wang and E.-P. Lim, “Zero-shot next-item recommendation using large pretrained language models,” arXiv preprint arXiv:2304.03153, 2023.
  244. A. Zhiyuli, Y. Chen, X. Zhang, and X. Liang, “Bookgpt: A general framework for book recommendation empowered by large language model,” arXiv preprint arXiv:2305.15673, 2023.
  245. J. Liu, C. Liu, R. Lv, K. Zhou, and Y. Zhang, “Is chatgpt a good recommender? a preliminary study,” arXiv preprint arXiv:2304.10149, 2023.
  246. S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun, X. Zhang, and J. Xu, “Uncovering chatgpt’s capabilities in recommender systems,” arXiv preprint arXiv:2305.02182, 2023.
  247. W.-C. Kang, J. Ni, N. Mehta, M. Sathiamoorthy, L. Hong, E. Chi, and D. Z. Cheng, “Do llms understand user preferences? evaluating llms on user rating prediction,” arXiv preprint arXiv:2305.06474, 2023.
  248. J. Zhang, K. Bao, Y. Zhang, W. Wang, F. Feng, and X. He, “Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation,” arXiv preprint arXiv:2305.07609, 2023.
  249. Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. McAuley, and W. X. Zhao, “Large language models are zero-shot rankers for recommender systems,” arXiv preprint arXiv:2305.08845, 2023.
  250. S. Mysore, A. McCallum, and H. Zamani, “Large language model augmented narrative driven recommendations,” arXiv preprint arXiv:2306.02250, 2023.
  251. C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for 0.42 each using chatgpt,” arXiv preprint arXiv:2304.00385, 2023.
  252. A. Cheshkov, P. Zadorozhny, and R. Levichev, “Evaluation of chatgpt model for vulnerability detection,” arXiv preprint arXiv:2304.07232, 2023.
  253. B. Yetiştiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” arXiv preprint arXiv:2304.10778, 2023.
  254. T.-O. Li, W. Zong, Y. Wang, H. Tian, Y. Wang, and S.-C. Cheung, “Finding failure-inducing test cases with chatgpt,” arXiv preprint arXiv:2304.11686, 2023.
  255. C. Liu, X. Bao, H. Zhang, N. Zhang, H. Hu, X. Zhang, and M. Yan, “Improving chatgpt prompt for code generation,” arXiv preprint arXiv:2305.08360, 2023.
  256. R. A. Poldrack, T. Lu, and G. Beguš, “Ai-assisted coding: Experiments with gpt-4,” arXiv preprint arXiv:2304.13187, 2023.
  257. J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” arXiv preprint arXiv:2305.01210, 2023.
  258. E. Chen, R. Huang, H.-S. Chen, Y.-H. Tseng, and L.-Y. Li, “Gptutor: a chatgpt-powered programming tool for code explanation,” arXiv preprint arXiv:2305.01863, 2023.
  259. N. Nascimento, P. Alencar, and D. Cowan, “Comparing software developers with chatgpt: An empirical investigation,” arXiv preprint arXiv:2305.11837, 2023.
  260. J. Y. Khan and G. Uddin, “Automatic code documentation generation using gpt-3,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–6.
  261. J. Leinonen, P. Denny, S. MacNeil, S. Sarsa, S. Bernstein, J. Kim, A. Tran, and A. Hellas, “Comparing code explanations created by students and large language models,” arXiv preprint arXiv:2304.03938, 2023.
  262. X.-Y. Li, J.-T. Xue, Z. Xie, and M. Li, “Think outside the code: Brainstorming boosts large language models in code generation,” arXiv preprint arXiv:2305.10679, 2023.
  263. J. A. Prenner and R. Robbes, “Automatic program repair with openai’s codex: Evaluating quixbugs,” arXiv preprint arXiv:2111.03922, 2021.
  264. M. L. Siddiq, J. C. S. Santos, R. H. Tanvir, N. Ulfat, F. A. Rifat, and V. C. Lopes, “Exploring the effectiveness of large language models in generating unit tests,” ArXiv, vol. abs/2305.00418, 2023.
  265. H. Tian, W. Lu, T. O. Li, X. Tang, S.-C. Cheung, J. Klein, and T. F. Bissyandé, “Is chatgpt the ultimate programming assistant–how far is it?” arXiv preprint arXiv:2304.11938, 2023.
  266. M. Geng, S. Wang, D. Dong, H. Wang, G. Li, Z. Jin, X. Mao, and X. Liao, “An empirical study on using large language models for multi-intent comment generation,” ArXiv, vol. abs/2304.11384, 2023.
  267. S. Kang, B. Chen, S. Yoo, and J.-G. Lou, “Explainable automated debugging via large language model-driven scientific debugging,” arXiv preprint arXiv:2304.02195, 2023.
  268. A. Kashefi and T. Mukerji, “Chatgpt for programming numerical methods,” ArXiv, vol. abs/2303.12093, 2023.
  269. G. Destefanis, S. Bartolucci, and M. Ortu, “A preliminary analysis on the code generation capabilities of gpt-3.5 and bard ai models for java functions,” arXiv preprint arXiv:2305.09402, 2023.
  270. Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng, “No more manual tests? evaluating and improving chatgpt for unit test generation,” ArXiv, vol. abs/2305.04207, 2023.
  271. T. Phung, V.-A. Padurean, J. P. Cambronero, S. Gulwani, T. Kohn, R. Majumdar, A. K. Singla, and G. Soares, “Generative ai for programming education: Benchmarking chatgpt, gpt-4, and human tutors,” ArXiv, vol. abs/2306.17156, 2023.
  272. X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,” arXiv preprint arXiv:2308.10620, 2023.
  273. S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  274. L. Phan, H. Tran, D. Le, H. Nguyen, J. Annibal, A. Peltekian, and Y. Ye, “Cotext: Multi-task learning with code-text transformer,” in Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), 2021, pp. 40–47.
  275. D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, L. Shujie, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al., “Graphcodebert: Pre-training code representations with data flow,” in International Conference on Learning Representations, 2020.
  276. W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2655–2668.
  277. D. Zan, B. Chen, D. Yang, Z. Lin, M. Kim, B. Guan, Y. Wang, W. Chen, and J.-G. Lou, “Cert: Continual pre-training on sketches for library-oriented code generation,” arXiv preprint arXiv:2206.06888, 2022.
  278. R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” in Proceedings of the 2014 international symposium on software testing and analysis, 2014, pp. 437–440.
  279. D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama, “Quixbugs: A multi-lingual program repair benchmark set based on the quixey challenge,” in Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity, 2017, pp. 55–56.
  280. A. Sundar and L. Heck, “Multimodal conversational ai: A survey of datasets and approaches,” in Proceedings of the 4th Workshop on NLP for Conversational AI, 2022, pp. 131–147.
  281. P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  282. Z. Shao, Z. Yu, M. Wang, and J. Yu, “Prompting large language models with answer heuristics for knowledge-based visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 974–14 983.
  283. Y. Lin, Y. Xie, D. Chen, Y. Xu, C. Zhu, and L. Yuan, “Revive: Regional visual representation matters in knowledge-based visual question answering,” arXiv preprint arXiv:2206.01201, 2022.
  284. L. Gui, B. Wang, Q. Huang, A. G. Hauptmann, Y. Bisk, and J. Gao, “Kat: A knowledge augmented transformer for vision-and-language,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 956–968.
  285. Y. Lu, X. Yang, X. Li, X. E. Wang, and W. Y. Wang, “Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation,” arXiv preprint arXiv:2305.11116, 2023.
  286. W. Zhu, X. Wang, Y. Lu, T.-J. Fu, X. E. Wang, M. Eckstein, and W. Y. Wang, “Collaborative generative ai: Integrating gpt-k for efficient editing in text-to-image generation,” arXiv preprint arXiv:2305.11317, 2023.
  287. T. Zhang, Y. Zhang, V. Vineet, N. Joshi, and X. Wang, “Controllable text-to-image generation with gpt-4,” arXiv preprint arXiv:2305.18583, 2023.
  288. S. Hong, J. Seo, S. Hong, H. Shin, and S. Kim, “Large language models are frame-level directors for zero-shot text-to-video generation,” arXiv preprint arXiv:2305.14330, 2023.
  289. R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y. Wu, Z. Hong, J. Huang, J. Liu et al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,” arXiv preprint arXiv:2304.12995, 2023.
  290. M. Ranjit, G. Ganapathy, R. Manuel, and T. Ganu, “Retrieval augmented chest x-ray report generation using openai gpt models,” arXiv preprint arXiv:2305.03660, 2023.
  291. S. S. Kalakonda, S. Maheshwari, and R. K. Sarvadevabhatla, “Action-gpt: Leveraging large-scale language models for improved and generalized zero shot action generation,” arXiv preprint arXiv:2211.15603, 2022.
  292. C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt: Talking, drawing and editing with visual foundation models,” arXiv preprint arXiv:2303.04671, 2023.
  293. Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv preprint arXiv:2303.11381, 2023.
  294. J. Li, H. Li, Z. Pan, and G. Pan, “Prompt chatgpt in mner: Improved multimodal named entity recognition method based on auxiliary refining knowledge from chatgpt,” arXiv preprint arXiv:2305.12212, 2023.
  295. S. Hakimov and D. Schlangen, “Images in language space: Exploring the suitability of large language models for vision & language tasks,” arXiv preprint arXiv:2305.13782, 2023.
  296. W. Feng, W. Zhu, T.-j. Fu, V. Jampani, A. Akula, X. He, S. Basu, X. E. Wang, and W. Y. Wang, “Layoutgpt: Compositional visual planning and generation with large language models,” arXiv preprint arXiv:2305.15393, 2023.
  297. L. Fan, D. Krishnan, P. Isola, D. Katabi, and Y. Tian, “Improving clip training with language rewrites,” arXiv preprint arXiv:2305.20088, 2023.
  298. C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” arXiv preprint arXiv:2306.00890, 2023.
  299. A. Bhattacharya, Y. K. Singla, B. Krishnamurthy, R. R. Shah, and C. Chen, “A video is worth 4096 tokens: Verbalize story videos to understand them in zero shot,” arXiv preprint arXiv:2305.09758, 2023.
  300. X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv preprint arXiv:2303.17395, 2023.
  301. D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv preprint arXiv:2305.11000, 2023.
  302. Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, and J. Liu, “Chatbridge: Bridging modalities with large language model as a language catalyst,” arXiv preprint arXiv:2305.16103, 2023.
  303. M. Zheng, X. Su, S. You, F. Wang, C. Qian, C. Xu, and S. Albanie, “Can gpt-4 perform neural architecture search?” arXiv preprint arXiv:2304.10970, 2023.
  304. Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface,” arXiv preprint arXiv:2303.17580, 2023.
  305. L. Zhang, Y. Zhang, K. Ren, D. Li, and Y. Yang, “Mlcopilot: Unleashing the power of large language models in solving machine learning tasks,” arXiv preprint arXiv:2304.14979, 2023.
  306. S. Zhang, C. Gong, L. Wu, X. Liu, and M. Zhou, “Automl-gpt: Automatic machine learning with gpt,” arXiv preprint arXiv:2305.02499, 2023.
  307. A. Olmo, S. Sreedharan, and S. Kambhampati, “Gpt3-to-plan: Extracting plans from text using gpt-3,” arXiv preprint arXiv:2106.07131, 2021.
  308. B. Zhang and H. Soh, “Large language models as zero-shot human models for human-robot interaction,” arXiv preprint arXiv:2303.03548, 2023.
  309. Y. Xie, C. Yu, T. Zhu, J. Bai, Z. Gong, and H. Soh, “Translating natural language to planning goals with large-language models,” arXiv preprint arXiv:2302.05128, 2023.
  310. H. Hu, H. Lu, H. Zhang, W. Lam, and Y. Zhang, “Chain-of-symbol prompting elicits planning in large langauge models,” arXiv preprint arXiv:2305.10276, 2023.
  311. K. Valmeekam, A. Olmo, S. Sreedharan, and S. Kambhampati, “Large language models still can’t plan (a benchmark for llms on planning and reasoning about change),” in NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
  312. K. M. Collins, C. Wong, J. Feng, M. Wei, and J. B. Tenenbaum, “Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks,” arXiv preprint arXiv:2205.05718, 2022.
  313. K. Mahowald, A. A. Ivanova, I. A. Blank, N. Kanwisher, J. B. Tenenbaum, and E. Fedorenko, “Dissociating language and thought in large language models: a cognitive perspective,” arXiv preprint arXiv:2301.06627, 2023.
  314. K. S. Kalyan and S. Sangeetha, “Medical concept normalization in user-generated texts by learning target concept embeddings,” in Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, 2020, pp. 18–23.
  315. ——, “Target concept guided medical concept normalization in noisy user-generated texts,” in Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 2020, pp. 64–73.
  316. J. Holmes, Z. Liu, L. Zhang, Y. Ding, T. T. Sio, L. A. McGee, J. B. Ashman, X. Li, T. Liu, J. Shen et al., “Evaluating large language models on a highly-specialized topic, radiation oncology physics,” arXiv preprint arXiv:2304.01938, 2023.
  317. Z. Liu, X. Yu, L. Zhang, Z. Wu, C. Cao, H. Dai, L. Zhao, W. Liu, D. Shen, Q. Li et al., “Deid-gpt: Zero-shot medical text de-identification by gpt-4,” arXiv preprint arXiv:2303.11032, 2023.
  318. J. Giorgi, A. Toma, R. Xie, S. Chen, K. An, G. Zheng, and B. Wang, “Wanglab at mediqa-chat 2023: Clinical note generation from doctor-patient conversations using large language models,” in Proceedings of the 5th Clinical Natural Language Processing Workshop, 2023, pp. 323–334.
  319. H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, “Capabilities of gpt-4 on medical challenge problems,” ArXiv, vol. abs/2303.13375, 2023.
  320. Q. Chen, J. Du, Y. Hu, V. K. Keloth, X. Peng, K. Raja, R. Zhang, Z. Lu, and H. Xu, “Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations,” arXiv preprint arXiv:2305.16326, 2023.
  321. Y. Tanaka, T. Nakata, K. Aiga, T. Etani, R. Muramatsu, S. Katagiri, H. Kawai, F. Higashino, M. Enomoto, M. Noda, M. Kometani, M. Takamura, T. Yoneda, H. Kakizaki, and A. Nomura, “Performance of generative pretrained transformer on the national medical licensing examination in japan,” in medRxiv, 2023.
  322. J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian, A. Liu, H. Wang, C. You, Z. Guo, L. Zhu et al., “Benchmarking large language models on cmexam–a comprehensive chinese medical exam dataset,” arXiv preprint arXiv:2306.03030, 2023.
  323. Z. Yang, S. Cherian, and S. Vucetic, “Data augmentation for radiology report simplification,” in Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1877–1887.
  324. C. Ma, Z. Wu, J. Wang, S. Xu, Y. Wei, Z. Liu, L. Guo, X. Cai, S. Zhang, T. Zhang et al., “Impressiongpt: an iterative optimizing framework for radiology report summarization with chatgpt,” arXiv preprint arXiv:2304.08448, 2023.
  325. M. Moradi, K. Blagec, F. Haberl, and M. Samwald, “Gpt-3 models are poor few-shot learners in the biomedical domain,” arXiv preprint arXiv:2109.02555, 2021.
  326. K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T. Stüber, J. Topalis, T. Weber, P. Wesp, B. Sabel, J. Ricke et al., “Chatgpt makes medicine easy to swallow: An exploratory case study on simplified radiology reports,” arXiv preprint arXiv:2212.14882, 2022.
  327. X. Tang, A. Tran, J. Tan, and M. Gerstein, “Gersteinlab at mediqa-chat 2023: Clinical note summarization from doctor-patient conversations through fine-tuning and in-context learning,” arXiv preprint arXiv:2305.05001, 2023.
  328. M. Agrawal, S. Hegselmann, H. Lang, Y. Kim, and D. Sontag, “Large language models are few-shot clinical information extractors,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1998–2022.
  329. V. Nair, E. Schumacher, and A. Kannan, “Generating medically-accurate summaries of patient-provider dialogue: A multi-stage approach using large language models,” arXiv preprint arXiv:2305.05982, 2023.
  330. C. Shaib, M. L. Li, S. Joseph, I. J. Marshall, J. J. Li, and B. C. Wallace, “Summarizing, simplifying, and synthesizing medical evidence using gpt-3 (with varying success),” arXiv preprint arXiv:2305.06299, 2023.
  331. J. Xu, L. Lu, S. Yang, B. Liang, X. Peng, J. Pang, J. Ding, X. Shi, L. Yang, H. Song et al., “Medgpteval: A dataset and benchmark to evaluate responses of large language models in medicine,” arXiv preprint arXiv:2305.07340, 2023.
  332. X. Wang, Z. Gong, G. Wang, J. Jia, Y. Xu, J. Zhao, Q. Fan, S. Wu, W. Hu, and X. Li, “Chatgpt performs on the chinese national medical licensing examination,” 2023.
  333. K. A. Carpenter and R. B. Altman, “Using gpt-3 to build a lexicon of drugs of abuse synonyms for social media pharmacovigilance,” Biomolecules, vol. 13, no. 2, p. 387, 2023.
  334. E. Hernandez, D. Mahajan, J. Wulff, M. J. Smith, Z. Ziegler, D. Nadler, P. Szolovits, A. Johnson, E. Alsentzer et al., “Do we still need clinical language models?” in Conference on Health, Inference, and Learning.   PMLR, 2023, pp. 578–597.
  335. A. S. Rao, M. Pang, J. Kim, M. Kamineni, W. Lie, A. K. Prasad, A. Landman, K. Dryer, and M. D. Succi, “Assessing the utility of chatgpt throughout the entire clinical workflow,” medRxiv, pp. 2023–02, 2023.
  336. T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. De Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo et al., “Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models,” PLoS digital health, vol. 2, no. 2, p. e0000198, 2023.
  337. A. Hulman, O. L. Dollerup, J. F. Mortensen, M. Fenech, K. Norman, H. Stoevring, and T. K. Hansen, “Chatgpt-versus human-generated answers to frequently asked questions about diabetes: a turing test-inspired survey among employees of a danish diabetes center,” medRxiv, pp. 2023–02, 2023.
  338. T. Hirosawa, Y. Harada, M. Yokose, T. Sakamoto, R. Kawamura, and T. Shimizu, “Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: A pilot study,” International journal of environmental research and public health, vol. 20, no. 4, p. 3378, 2023.
  339. S. Liu, A. P. Wright, B. L. Patterson, J. P. Wanderer, R. W. Turer, S. D. Nelson, A. B. McCoy, D. F. Sittig, and A. Wright, “Assessing the value of chatgpt for clinical decision support optimization,” MedRxiv, pp. 2023–02, 2023.
  340. A. Gilson, C. W. Safranek, T. Huang, V. Socrates, L. Chi, R. A. Taylor, D. Chartash et al., “How does chatgpt perform on the united states medical licensing examination? the implications of large language models for medical education and knowledge assessment,” JMIR Medical Education, vol. 9, no. 1, p. e45312, 2023.
  341. F. Antaki, S. Touma, D. Milad, J. El-Khoury, and R. Duval, “Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings,” Ophthalmology Science, p. 100324, 2023.
  342. Q. Lyu, J. Tan, M. E. Zapadka, J. Ponnatapura, C. Niu, K. J. Myers, G. Wang, and C. T. Whitlow, “Translating radiology reports into plain language using chatgpt and gpt-4 with prompt learning: results, limitations, and potential,” Visual Computing for Industry, Biomedicine, and Art, vol. 6, no. 1, p. 9, 2023.
  343. F. Yu, L. Quartey, and F. Schilder, “Legal prompting: Teaching a language model to think like a lawyer,” arXiv preprint arXiv:2212.01326, 2022.
  344. H.-T. Nguyen, “A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3,” arXiv preprint arXiv:2302.05729, 2023.
  345. I. Chalkidis, “Chatgpt may pass the bar exam soon, but has a long way to go for the lexglue benchmark,” arXiv preprint arXiv:2304.12202, 2023.
  346. J. H. Choi, K. E. Hickman, A. Monahan, and D. Schwarcz, “Chatgpt goes to law school,” Available at SSRN, 2023.
  347. X. Cai, S. Liu, J. Han, L. Yang, Z. Liu, and T. Liu, “Chestxraybert: A pretrained language model for chest radiology report summarization,” IEEE Transactions on Multimedia, 2021.
  348. H. Xiong, S. Wang, Y. Zhu, Z. Zhao, Y. Liu, Q. Wang, and D. Shen, “Doctorglm: Fine-tuning your chinese doctor is not a herculean task,” arXiv preprint arXiv:2304.01097, 2023.
  349. A. B. Abacha, W.-w. Yim, G. Adams, N. Snider, and M. Yetisgen-Yildiz, “Overview of the mediqa-chat 2023 shared tasks on the summarization & generation of doctor-patient conversations,” in Proceedings of the 5th Clinical Natural Language Processing Workshop, 2023, pp. 503–513.
  350. H. Su, J. Kasai, Y. Wang, Y. Hu, M. Ostendorf, W.-t. Yih, N. A. Smith, L. Zettlemoyer, T. Yu et al., “One embedder, any task: Instruction-finetuned text embeddings,” arXiv preprint arXiv:2212.09741, 2022.
  351. Y. Lan, Y. Wu, W. Xu, W. Feng, and Y. Zhang, “Chinese fine-grained financial sentiment analysis with large language models,” arXiv preprint arXiv:2306.14096, 2023.
  352. G. Fatouros, J. Soldatos, K. Kouroumali, G. Makridis, and D. Kyriazis, “Transforming sentiment analysis in the financial domain with chatgpt,” arXiv preprint arXiv:2308.07935, 2023.
  353. M. Leippold, “Sentiment spin: Attacking financial sentiment with gpt-3,” Finance Research Letters, p. 103957, 2023.
  354. P. Wiriyathammabhum, “Promptshots at the finnlp-2022 erai task: Pairwise comparison and unsupervised ranking,” in Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP), 2022, pp. 104–110.
  355. A. Shah and S. Chava, “Zero is not hero yet: Benchmarking zero-shot performance of llms for financial tasks,” arXiv preprint arXiv:2305.16633, 2023.
  356. L. Zhang, W. Cai, Z. Liu, Z. Yang, W. Dai, Y. Liao, Q. Qin, Y. Li, X. Liu, Z. Liu et al., “Fineval: A chinese financial domain knowledge evaluation benchmark for large language models,” arXiv preprint arXiv:2308.09975, 2023.
  357. P. K. Rajpoot and A. Parikh, “Gpt-finre: In-context learning for financial relation extraction using large language models,” arXiv preprint arXiv:2306.17519, 2023.
  358. L. Loukas, I. Stogiannidis, P. Malakasiotis, and S. Vassos, “Breaking the bank with chatgpt: Few-shot text classification for finance,” arXiv preprint arXiv:2308.14634, 2023.
  359. I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos, D. Katz, and N. Aletras, “Lexglue: A benchmark dataset for legal language understanding in english,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4310–4330.
  360. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  361. Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. R. Routledge et al., “Finqa: A dataset of numerical reasoning over financial data,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3697–3711.
  362. V. D. Lai, N. T. Ngo, A. P. B. Veyseh, H. Man, F. Dernoncourt, T. Bui, and T. H. Nguyen, “Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning,” arXiv preprint arXiv:2304.05613, 2023.
  363. T. Fang, S. Yang, K. Lan, D. F. Wong, J. Hu, L. S. Chao, and Y. Zhang, “Is chatgpt a highly fluent grammatical error correction system? a comprehensive evaluation,” arXiv preprint arXiv:2304.01746, 2023.
  364. J. Armengol-Estapé, O. de Gibert Bonet, and M. Melero, “On the multilingual capabilities of very large-scale english language models,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 3056–3068.
  365. K. Ahuja, R. Hada, M. Ochieng, P. Jain, H. Diddee, S. Maina, T. Ganu, S. Segal, M. Axmed, K. Bali et al., “Mega: Multilingual evaluation of generative ai,” arXiv preprint arXiv:2303.12528, 2023.
  366. X. Zhang, S. Li, B. Hauer, N. Shi, and G. Kondrak, “Don’t trust gpt when your question is not in english,” arXiv preprint arXiv:2305.16339, 2023.
  367. M. Das, S. K. Pandey, and A. Mukherjee, “Evaluating chatgpt’s performance for multilingual and emoji-based hate speech detection,” arXiv preprint arXiv:2305.13276, 2023.
  368. R. Hada, V. Gumma, A. de Wynter, H. Diddee, M. Ahmed, M. Choudhury, K. Bali, and S. Sitaram, “Are large language model-based evaluators the solution to scaling up multilingual evaluation?” arXiv preprint arXiv:2309.07462, 2023.
  369. W. Q. Leong, J. G. Ngui, Y. Susanto, H. Rengarajan, K. Sarveswaran, and W. C. Tjhi, “Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models,” arXiv preprint arXiv:2309.06085, 2023.
  370. R. Bommasani, P. Liang, and T. Lee, “Holistic evaluation of language models,” Annals of the New York Academy of Sciences, 2023.
  371. A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,” Transactions on Machine Learning Research, 2023.
  372. F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt outperforms crowd-workers for text-annotation tasks,” arXiv preprint arXiv:2303.15056, 2023.
  373. X. He, Z. Lin, Y. Gong, A. Jin, H. Zhang, C. Lin, J. Jiao, S. M. Yiu, N. Duan, W. Chen et al., “Annollm: Making large language models to be better crowdsourced annotators,” arXiv preprint arXiv:2303.16854, 2023.
  374. P. Törnberg, “Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning,” arXiv preprint arXiv:2304.06588, 2023.
  375. Y. Zhu, P. Zhang, E.-U. Haq, P. Hui, and G. Tyson, “Can chatgpt reproduce human-generated labels? a study of social computing tasks,” arXiv preprint arXiv:2304.10145, 2023.
  376. L. Li, L. Fan, S. Atreja, and L. Hemphill, “” hot” chatgpt: The promise of chatgpt in detecting and discriminating hateful, offensive, and toxic comments on social media,” arXiv preprint arXiv:2304.10619, 2023.
  377. Y. Gu, S. Zhang, N. Usuyama, Y. Woldesenbet, C. Wong, P. Sanapathi, M. Wei, N. Valluri, E. Strandberg, T. Naumann et al., “Distilling large language models for biomedical knowledge extraction: A case study on adverse drug events,” arXiv preprint arXiv:2307.06439, 2023.
  378. S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng, “Want to reduce labeling cost? gpt-3 can help,” in Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 4195–4205.
  379. B. Ding, C. Qin, L. Liu, L. Bing, S. Joty, and B. Li, “Is gpt-3 a good data annotator?” arXiv preprint arXiv:2212.10450, 2022.
  380. S. Meoni, E. De la Clergerie, and T. Ryffel, “Large language models as instructors: A study on multilingual clinical entity extraction,” in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, 2023, pp. 178–190.
  381. Y. Xu, R. Xu, D. Iter, Y. Liu, S. Wang, C. Zhu, and M. Zeng, “Inheritsumm: A general, versatile and compact summarizer by distilling from gpt,” arXiv preprint arXiv:2305.13083, 2023.
  382. M. Alizadeh, M. Kubli, Z. Samei, S. Dehghani, J. D. Bermeo, M. Korobeynikova, and F. Gilardi, “Open-source large language models outperform crowd workers and approach chatgpt in text-annotation tasks,” arXiv preprint arXiv:2307.02179, 2023.
  383. S. Thapa, U. Naseem, and M. Nasim, “From humans to machines: can chatgpt-like llms effectively replace human annotators in nlp tasks,” in Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media, 2023.
  384. J. S. Murthy, G. Siddesh, and K. Srinivasa, “Twitsenti: a real-time twitter sentiment analysis and visualization framework,” Journal of Information & Knowledge Management, vol. 18, no. 02, p. 1950013, 2019.
  385. W. Van Atteveldt, M. A. Van der Velden, and M. Boukes, “The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms,” Communication Methods and Measures, vol. 15, no. 2, pp. 121–140, 2021.
  386. M. Chmielewski and S. C. Kucker, “An mturk crisis? shifts in data quality and the impact on study results,” Social Psychological and Personality Science, vol. 11, no. 4, pp. 464–473, 2020.
  387. P. He, B. Peng, L. Lu, S. Wang, J. Mei, Y. Liu, R. Xu, H. H. Awadalla, Y. Shi, C. Zhu et al., “Z-code++: A pre-trained language model optimized for abstractive summarization,” arXiv preprint arXiv:2208.09770, 2022.
  388. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  389. J. Cegin, J. Simko, and P. Brusilovsky, “Chatgpt to replace crowdsourcing of paraphrases for intent classification: Higher diversity and comparable model robustness,” arXiv preprint arXiv:2305.12947, 2023.
  390. S. Oh, W. Jung et al., “Data augmentation for neural machine translation using generative language model,” arXiv preprint arXiv:2307.16833, 2023.
  391. S. Sharma, A. Joshi, N. Mukhija, Y. Zhao, H. Bhathena, P. Singh, S. Santhanam, and P. Biswas, “Systematic review of effect of data augmentation using paraphrasing on named entity recognition,” in NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 2022.
  392. Z. Guo, P. Wang, Y. Wang, and S. Yu, “Dr. llama: Improving small language models in domain-specific qa via generative data augmentation,” arXiv preprint arXiv:2305.07804, 2023.
  393. A. Abaskohi, S. Rothe, and Y. Yaghoobzadeh, “Lm-cppf: Paraphrasing-guided data augmentation for contrastive prompt-based few-shot fine-tuning,” arXiv preprint arXiv:2305.18169, 2023.
  394. S. Sarker, L. Qian, and X. Dong, “Medical data augmentation via chatgpt: A case study on medication identification and medication event classification,” arXiv preprint arXiv:2306.07297, 2023.
  395. H. Dai, Z. Liu, W. Liao, X. Huang, Y. Cao, Z. Wu, L. Zhao, S. Xu, W. Liu, N. Liu et al., “Auggpt: Leveraging chatgpt for text data augmentation,” arXiv preprint arXiv:2302.13007, 2023.
  396. Y. Fang, X. Li, S. W. Thomas, and X. Zhu, “Chatgpt as data augmentation for compositional generalization: A case study in open intent detection,” arXiv preprint arXiv:2308.13517, 2023.
  397. C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019.
  398. B. Li, Y. Hou, and W. Che, “Data augmentation approaches in natural language processing: A survey,” Ai Open, vol. 3, pp. 71–90, 2022.
  399. P. Liu, X. Wang, C. Xiang, and W. Meng, “A survey of text data augmentation,” in 2020 International Conference on Computer Communication and Network Security (CCNS).   IEEE, 2020, pp. 191–195.
  400. S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy, “A survey of data augmentation approaches for nlp,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 968–988.
  401. M. Bayer, M.-A. Kaufhold, and C. Reuter, “A survey on data augmentation for text classification,” ACM Computing Surveys, vol. 55, no. 7, pp. 1–39, 2022.
  402. Y. Belinkov and Y. Bisk, “Synthetic and natural noise both break neural machine translation,” in International Conference on Learning Representations, 2018.
  403. C. Coulombe, “Text data augmentation made simple by leveraging nlp cloud apis,” arXiv preprint arXiv:1812.04718, 2018.
  404. J. Wei and K. Zou, “Eda: Easy data augmentation techniques for boosting performance on text classification tasks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6382–6388.
  405. W. Y. Wang and D. Yang, “That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 2557–2563.
  406. R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 86–96.
  407. C. Mallikarjuna and S. Sivanesan, “Question classification using limited labelled data,” Information Processing & Management, vol. 59, no. 6, p. 103094, 2022.
  408. H. Zhan, Z. Li, Y. Wang, L. Luo, T. Feng, X. Kang, Y. Hua, L. Qu, L.-K. Soon, S. Sharma et al., “Socialdial: A benchmark for socially-aware dialogue systems,” arXiv preprint arXiv:2304.12026, 2023.
  409. J. Wang, Z. Yao, A. Mitra, S. Osebe, Z. Yang, and H. Yu, “Umass_bionlp at mediqa-chat 2023: Can llms generate high-quality synthetic note-oriented doctor-patient conversations?” arXiv preprint arXiv:2306.16931, 2023.
  410. S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D. Giorno, S. Gopi, M. Javaheripi, P. C. Kauffmann, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, H. S. Behl, X. Wang, S. Bubeck, R. Eldan, A. T. Kalai, Y. T. Lee, and Y.-F. Li, “Textbooks are all you need,” ArXiv, vol. abs/2306.11644, 2023.
  411. C. Whitehouse, M. Choudhury, and A. F. Aji, “Llm-powered data augmentation for enhanced crosslingual performance,” ArXiv, vol. abs/2305.14288, 2023.
  412. T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, and E. Kamar, “Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 3309–3326.
  413. T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng, “A holistic approach to undesired content detection in the real world,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 12, 2023, pp. 15 009–15 018.
  414. Z. Guo, P. Wang, Y. Wang, and S. Yu, “Dr. llama: Improving small language models on pubmedqa via generative data augmentation,” ArXiv, vol. abs/2305.07804, 2023.
  415. R. Eldan and Y. Li, “Tinystories: How small can language models be and still speak coherent english?” arXiv preprint arXiv:2305.07759, 2023.
  416. H. Liu, Z. Teng, L. Cui, C. Zhang, Q. Zhou, and Y. Zhang, “Logicot: Logical chain-of-thought instruction-tuning data collection with gpt-4,” arXiv preprint arXiv:2305.12147, 2023.
  417. B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” arXiv preprint arXiv:2304.03277, 2023.
  418. I. Malkiel, U. Alon, Y. Yehuda, S. Keren, O. Barkan, R. Ronen, and N. Koenigstein, “Gpt-calls: Enhancing call segmentation and tagging by generating synthetic conversations via large language models,” arXiv preprint arXiv:2306.07941, 2023.
  419. J. P. Wahle, T. Ruas, F. Kirstein, and B. Gipp, “How large language models are transforming machine-paraphrase plagiarism,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 952–963.
  420. A. Michail, S. Konstantinou, and S. Clematide, “Uzh_clyp at semeval-2023 task 9: Head-first fine-tuning and chatgpt data generation for cross-lingual learning in tweet intimacy prediction,” arXiv preprint arXiv:2303.01194, 2023.
  421. R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic data generation of llms help clinical text mining?” arXiv preprint arXiv:2303.04360, 2023.
  422. Y. Yu, Y. Zhuang, J. Zhang, Y. Meng, A. Ratner, R. Krishna, J. Shen, and C. Zhang, “Large language model as attributed training data generator: A tale of diversity and bias,” arXiv preprint arXiv:2306.15895, 2023.
  423. W. Yang and G. Nicolai, “Neural machine translation data generation and augmentation using chatgpt,” arXiv preprint arXiv:2307.05779, 2023.
  424. Y. Zhao, C. Zhao, L. Nan, Z. Qi, W. Zhang, X. Tang, B. Mi, and D. Radev, “Robut: A systematic study of table qa robustness against human-annotated adversarial perturbations,” arXiv preprint arXiv:2306.14321, 2023.
  425. W. Xu, D. Wang, L. Pan, Z. Song, M. Freitag, W. Y. Wang, and L. Li, “Instructscore: Towards explainable text generation evaluation with automatic feedback,” arXiv preprint arXiv:2305.14282, 2023.
  426. A. Sugiyama and N. Yoshinaga, “Data augmentation using back-translation for context-aware neural machine translation,” in Proceedings of the fourth workshop on discourse in machine translation (DiscoMT 2019), 2019, pp. 35–44.
  427. F. Mireshghallah, J. Mattern, S. Gao, R. Shokri, and T. Berg-Kirkpatrick, “Smaller language models are better black-box machine-generated text detectors,” ArXiv, vol. abs/2305.09859, 2023.
  428. B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, and Y. Wu, “How close is chatgpt to human experts? comparison corpus, evaluation, and detection,” ArXiv, vol. abs/2301.07597, 2023.
  429. P. Hacker, A. Engel, and M. Mauer, “Regulating chatgpt and other large generative ai models,” in Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 2023, pp. 1112–1123.
  430. L. De Angelis, F. Baglivo, G. Arzilli, G. P. Privitera, P. Ferragina, A. E. Tozzi, and C. Rizzo, “Chatgpt and the rise of large language models: the new ai-driven infodemic threat in public health,” Frontiers in Public Health, vol. 11, p. 1166120, 2023.
  431. S. Mitrovi’c, D. Andreoletti, and O. Ayoub, “Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text,” ArXiv, vol. abs/2301.13852, 2023.
  432. C. A. Gao, F. M. Howard, N. S. Markov, E. C. Dyer, S. Ramesh, Y. Luo, and A. T. Pearson, “Comparing scientific abstracts generated by chatgpt to real abstracts with detectors and blinded human reviewers,” NPJ Digital Medicine, vol. 6, no. 1, p. 75, 2023.
  433. D. R. Cotton, P. A. Cotton, and J. R. Shipway, “Chatting and cheating: Ensuring academic integrity in the era of chatgpt,” Innovations in Education and Teaching International, pp. 1–12, 2023.
  434. P. C. Theocharopoulos, P. Anagnostou, A. Tsoukala, S. V. Georgakopoulos, S. K. Tasoulis, and V. P. Plagianakos, “Detection of fake generated scientific abstracts,” arXiv preprint arXiv:2304.06148, 2023.
  435. W. Zaitsu and M. Jin, “Distinguishing chatgpt (-3.5,-4)-generated and human-written papers through japanese stylometric analysis,” arXiv preprint arXiv:2304.05534, 2023.
  436. P. Yu, J. Chen, X. Feng, and Z. Xia, “Cheat: A large-scale dataset for detecting chatgpt-written abstracts,” arXiv preprint arXiv:2304.12008, 2023.
  437. X. Yang, W. Cheng, L. Petzold, W. Y. Wang, and H. Chen, “Dna-gpt: Divergent n-gram analysis for training-free detection of gpt-generated text,” arXiv preprint arXiv:2305.17359, 2023.
  438. Y. Liu, Z. Zhang, W. Zhang, S. Yue, X. Zhao, X. Cheng, Y. Zhang, and H. Hu, “Argugpt: evaluating, understanding and identifying argumentative essays generated by gpt models,” arXiv preprint arXiv:2304.07666, 2023.
  439. M. S. Orenstrakh, O. Karnalim, C. A. Suarez, and M. Liut, “Detecting llm-generated text in computing education: A comparative study for chatgpt cases,” arXiv preprint arXiv:2307.07411, 2023.
  440. W. Liao, Z. Liu, H. Dai, S. Xu, Z. Wu, Y. Zhang, X. Huang, D. Zhu, H. Cai, T. Liu et al., “Differentiate chatgpt-generated and human-written medical texts,” arXiv preprint arXiv:2304.11567, 2023.
  441. H. Zhan, X. He, Q. Xu, Y. Wu, and P. Stenetorp, “G3detector: General gpt-generated text detector,” arXiv preprint arXiv:2305.12680, 2023.
  442. E. Clark, T. August, S. Serrano, N. Haduong, S. Gururangan, and N. A. Smith, “All that’s ‘human’is not gold: Evaluating human evaluation of generated text,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 7282–7296.
  443. A. Pegoraro, K. Kumari, H. Fereidooni, and A.-R. Sadeghi, “To chatgpt, or not to chatgpt: That is the question!” arXiv preprint arXiv:2304.01487, 2023.
  444. Z. Shi, Y. Wang, F. Yin, X. Chen, K.-W. Chang, and C.-J. Hsieh, “Red teaming language model detectors with language models,” arXiv preprint arXiv:2305.19713, 2023.
  445. M. Khalil and E. Er, “Will chatgpt get you caught? rethinking of plagiarism detection,” arXiv preprint arXiv:2302.04335, 2023.
  446. X. He, X. Shen, Z. Chen, M. Backes, and Y. Zhang, “Mgtbench: Benchmarking machine-generated text detection,” arXiv preprint arXiv:2303.14822, 2023.
  447. H. Wang, X. Luo, W. Wang, and X. Yan, “Bot or human? detecting chatgpt imposters with a single question,” ArXiv, vol. abs/2305.06424, 2023.
  448. Y. Chen, H. Kang, V. Zhai, L. Li, R. Singh, and B. Ramakrishnan, “Gpt-sentinel: Distinguishing human and chatgpt generated content,” ArXiv, vol. abs/2305.07969, 2023.
  449. X. Yu, Y. Qi, K. Chen, G. Chen, X. Yang, P. Zhu, W. Zhang, and N. H. Yu, “Gpt paternity test: Gpt generated text detection with gpt genetic inheritance,” ArXiv, vol. abs/2305.12519, 2023.
  450. L. Yang, F. Jiang, and H. Li, “Is chatgpt involved in texts? measure the polish ratio to detect chatgpt-generated text,” ArXiv, vol. abs/2307.11380, 2023.
  451. K. Krishna, Y. Song, M. Karpinska, J. Wieting, and M. Iyyer, “Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense,” arXiv preprint arXiv:2303.13408, 2023.
  452. D. Ippolito, D. Duckworth, C. Callison-Burch, and D. Eck, “Automatic detection of generated text is easiest when humans are fooled,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 1808–1822.
  453. S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
  454. X. Chen, J. Ye, C. Zu, N. Xu, R. Zheng, M. Peng, J. Zhou, T. Gui, Q. Zhang, and X. Huang, “How robust is gpt-3.5 to predecessors? a comprehensive study on language understanding tasks,” arXiv preprint arXiv:2303.00293, 2023.
  455. J. Wang, X. Hu, W. Hou, H. Chen, R. Zheng, Y. Wang, L. Yang, H. Huang, W. Ye, X. Geng et al., “On the robustness of chatgpt: An adversarial and out-of-distribution perspective,” arXiv preprint arXiv:2302.12095, 2023.
  456. T. Y. Zhuo, Z. Li, Y. Huang, Y.-F. Li, W. Wang, G. Haffari, and F. Shiri, “On robustness of prompt-based semantic parsing with large pre-trained language model: An empirical study on codex,” arXiv preprint arXiv:2301.12868, 2023.
  457. K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, L. Yang, W. Ye, N. Z. Gong, Y. Zhang et al., “Promptbench: Towards evaluating the robustness of large language models on adversarial prompts,” arXiv preprint arXiv:2306.04528, 2023.
  458. A. Shirafuji, Y. Watanobe, T. Ito, M. Morishita, Y. Nakamura, Y. Oda, and J. Suzuki, “Exploring the robustness of large language models for solving programming problems,” arXiv preprint arXiv:2306.14583, 2023.
  459. R. Han, T. Peng, C. Yang, B. Wang, L. Liu, and X. Wan, “Is information extraction solved by chatgpt? an analysis of performance, evaluation criteria, robustness and errors,” arXiv preprint arXiv:2305.14450, 2023.
  460. H. Liu, R. Ning, Z. Teng, J. Liu, Q. Zhou, and Y. Zhang, “Evaluating the logical reasoning ability of chatgpt and gpt-4,” arXiv preprint arXiv:2304.03439, 2023.
  461. A. Liu, X. Hu, L. Wen, and P. S. Yu, “A comprehensive evaluation of chatgpt’s zero-shot text-to-sql capability,” arXiv preprint arXiv:2303.13547, 2023.
  462. E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, and C. Finn, “Detectgpt: Zero-shot machine-generated text detection using probability curvature,” arXiv preprint arXiv:2301.11305, 2023.
  463. S. Goyal, S. Doddapaneni, M. M. Khapra, and B. Ravindran, “A survey of adversarial defences and robustness in nlp,” ACM Computing Surveys, 2022.
  464. S. Qiu, Q. Liu, S. Zhou, and W. Huang, “Adversarial attack and defense technologies in natural language processing: A survey,” Neurocomputing, vol. 492, pp. 278–307, 2022.
  465. Z. Shen, J. Liu, Y. He, X. Zhang, R. Xu, H. Yu, and P. Cui, “Towards out-of-distribution generalization: A survey,” arXiv preprint arXiv:2108.13624, 2021.
  466. X. Wang, Q. Liu, T. Gui, Q. Zhang, Y. Zou, X. Zhou, J. Ye, Y. Zhang, R. Zheng, Z. Pang et al., “Textflint: Unified multilingual robustness evaluation toolkit for natural language processing,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, 2021, pp. 347–355.
  467. Y. Chen, R. Wang, H. Jiang, S. Shi, and R. Xu, “Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study,” arXiv preprint arXiv:2304.00723, 2023.
  468. A. B. Sai, A. K. Mohankumar, and M. M. Khapra, “A survey of evaluation metrics used for nlg systems,” ACM Computing Surveys (CSUR), vol. 55, no. 2, pp. 1–39, 2022.
  469. T. Y. Zhuo, “Large language models are state-of-the-art evaluators of code generation,” arXiv preprint arXiv:2304.14317, 2023.
  470. H. Lai, A. Toral, and M. Nissim, “Multidimensional evaluation for text style transfer using chatgpt,” arXiv preprint arXiv:2304.13462, 2023.
  471. Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “Gpteval: Nlg evaluation using gpt-4 with better human alignment,” arXiv preprint arXiv:2303.16634, 2023.
  472. T. Kocmi and C. Federmann, “Large language models are state-of-the-art evaluators of translation quality,” arXiv preprint arXiv:2302.14520, 2023.
  473. Q. Lu, B. Qiu, L. Ding, L. Xie, and D. Tao, “Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt,” arXiv preprint arXiv:2303.13809, 2023.
  474. Z. Luo, Q. Xie, and S. Ananiadou, “Chatgpt as a factual inconsistency evaluator for text summarization,” 2023.
  475. C. Shen, L. Cheng, Y. You, and L. Bing, “Are large language models good evaluators for abstractive summarization?” arXiv preprint arXiv:2305.13091, 2023.
  476. J. Fu, S.-K. Ng, Z. Jiang, and P. Liu, “Gptscore: Evaluate as you desire,” arXiv preprint arXiv:2302.04166, 2023.
  477. Y. Liu, A. R. Fabbri, P. Liu, D. Radev, and A. Cohan, “On learning to summarize with large language models as references,” arXiv preprint arXiv:2305.14239, 2023.
  478. M. Gao, J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan, “Human-like summarization evaluation with chatgpt,” arXiv preprint arXiv:2304.02554, 2023.
  479. T. Tang, H. Lu, Y. E. Jiang, H. Huang, D. Zhang, W. X. Zhao, and F. Wei, “Not all metrics are guilty: Improving nlg evaluation with llm paraphrasing,” arXiv preprint arXiv:2305.15067, 2023.
  480. P. Wang, L. Li, L. Chen, D. Zhu, B. Lin, Y. Cao, Q. Liu, T. Liu, and Z. Sui, “Large language models are not fair evaluators,” arXiv preprint arXiv:2305.17926, 2023.
  481. S. Jain, V. Keshava, S. M. Sathyendra, P. Fernandes, P. Liu, G. Neubig, and C. Zhou, “Multi-dimensional evaluation of text summarization with in-context learning,” arXiv preprint arXiv:2306.01200, 2023.
  482. J. Wang, Y. Liang, F. Meng, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou, “Is chatgpt a good nlg evaluator? a preliminary study,” arXiv preprint arXiv:2303.04048, 2023.
  483. Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu, K. Zeng, Y. Xiao, H. Lyu et al., “Benchmarking foundation models with language-model-as-an-examiner,” arXiv preprint arXiv:2306.04181, 2023.
  484. W. Yang, C. Li, J. Zhang, and C. Zong, “Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages,” arXiv preprint arXiv:2305.18098, 2023.
  485. L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” arXiv preprint arXiv:2306.05685, 2023.
  486. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
  487. C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
  488. S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
  489. T. Kocmi, C. Federmann, R. Grundkiewicz, M. Junczys-Dowmunt, H. Matsushita, and A. Menezes, “To ship or not to ship: An extensive evaluation of automatic metrics for machine translation,” in Proceedings of the Sixth Conference on Machine Translation, 2021, pp. 478–494.
  490. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” in International Conference on Learning Representations, 2019.
  491. W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger, “Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 563–578.
  492. W. Yuan, G. Neubig, and P. Liu, “Bartscore: Evaluating generated text as text generation,” Advances in Neural Information Processing Systems, vol. 34, pp. 27 263–27 277, 2021.
  493. S. Zhou, U. Alon, S. Agarwal, and G. Neubig, “Codebertscore: Evaluating code generation with pretrained models of code,” arXiv preprint arXiv:2302.05527, 2023.
  494. J. He, W. Kryściński, B. McCann, N. Rajani, and C. Xiong, “Ctrlsum: Towards generic controllable text summarization,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 5879–5915.
  495. C. Shen, L. Cheng, L. Bing, Y. You, and L. Si, “Sentbs: Sentence-level beam search for controllable summarization,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 10 256–10 265.
  496. Y. Liu, P. Liu, D. Radev, and G. Neubig, “Brio: Bringing order to abstractive summarization,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2890–2903.
  497. Q. Lu, L. Ding, L. Xie, K. Zhang, D. F. Wong, and D. Tao, “Toward human-like evaluation for natural language generation with error analysis,” arXiv preprint arXiv:2212.10179, 2022.
  498. R. Bhardwaj and S. Poria, “Red-teaming large language models using chain of utterances for safety-alignment,” arXiv preprint arXiv:2308.09662, 2023.
  499. D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,” arXiv preprint arXiv:2209.07858, 2022.
  500. N. Mehrabi, P. Goyal, C. Dupuy, Q. Hu, S. Ghosh, R. Zemel, K.-W. Chang, A. Galstyan, and R. Gupta, “Flirt: Feedback loop in-context red teaming,” arXiv preprint arXiv:2308.04265, 2023.
  501. E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3419–3448.
  502. L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large language models while reducing cost and improving performance,” arXiv preprint arXiv:2305.05176, 2023.
  503. Z. Cheng, J. Kasai, and T. Yu, “Batch prompting: Efficient inference with large language model apis,” arXiv preprint arXiv:2301.08721, 2023.
  504. Y. Li, “Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering,” arXiv preprint arXiv:2304.12102, 2023.
  505. M. A. Arefeen, B. Debnath, and S. Chakradhar, “Leancontext: Cost-efficient domain-specific question answering using llms,” arXiv preprint arXiv:2309.00841, 2023.
  506. S. Golchin and M. Surdeanu, “Time travel in llms: Tracing data contamination in large language models,” arXiv preprint arXiv:2308.08493, 2023.
  507. R. Aiyappa, J. An, H. Kwak, and Y.-Y. Ahn, “Can we trust the evaluation on chatgpt?” arXiv preprint arXiv:2303.12767, 2023.
  508. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” in International Conference on Learning Representations, 2018.
  509. X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in neural information processing systems, vol. 28, 2015.
  510. S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 1797–1807.
  511. Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen et al., “Siren’s song in the ai ocean: A survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
  512. V. Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922, 2023.
  513. S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain-of-verification reduces hallucination in large language models,” 2023.
  514. L. K. Umapathi, A. Pal, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” arXiv preprint arXiv:2307.15343, 2023.
  515. J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” arXiv e-prints, pp. arXiv–2305, 2023.
  516. B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen et al., “Check your facts and try again: Improving large language models with external knowledge and automated feedback,” arXiv preprint arXiv:2302.12813, 2023.
Citations (134)

Summary

We haven't generated a summary for this paper yet.