Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comprehensive Overview of Large Language Models (2307.06435v10)

Published 12 Jul 2023 in cs.CL
A Comprehensive Overview of Large Language Models

Abstract: LLMs have recently demonstrated remarkable capabilities in natural language processing tasks and beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs, robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from extensive informative summaries of the existing works to advance the LLM research.

Overview of "A Comprehensive Overview of LLMs"

The paper "A Comprehensive Overview of LLMs" offers an extensive survey of developments and research directions within the domain of LLMs. The authors provide a thorough examination of the advancements, challenges, and methodologies associated with LLMs, presenting insights that can aid researchers and practitioners in understanding the current landscape and future potential of these models.

Architectures and Training Strategies

The paper meticulously reviews various architectural choices and training strategies that have been instrumental in the evolution of LLMs. The discussion highlights the differences between encoder-decoder, decoder-only, and encoder-only architectures, emphasizing their suitability for specific tasks such as NLU, NLG, and sequence-to-sequence modeling. Notably, the encoder-decoder architecture is favored for its versatility in mode-switching and effectiveness in various contexts.

Pre-trained and Instruction-tuned Models

A significant portion of the paper is dedicated to exploring key pre-trained models, including T5, GPT-3, mT5, and others. These models have demonstrated remarkable capabilities in zero-shot and few-shot learning, often surpassing traditional models in comprehensive benchmarks. Instruction-tuning, highlighted through models like T0 and Flan, is emphasized as a crucial step in enhancing zero-shot performance and generalization to unseen tasks, with CoT training unlocking reasoning capabilities.

Insights into Fine-Tuning and Alignment

The exploration of fine-tuning strategies, including parameter-efficient methods such as adapter tuning, LoRA, and prompt tuning, offers valuable insights into improving model adaptation without full retraining. The alignment with human preferences is also discussed, focusing on techniques like RLHF and semi-automated alignment, which are pivotal in ensuring ethical and aligned outputs from LLMs.

Efficient Utilization and Multimodal Integration

The authors address the challenges associated with deploying LLMs, presenting methods like quantization, pruning, and multimodal integration. With advancements in efficient attention mechanisms and parameter-efficient fine-tuning, the models can be adapted for real-world applications without substantial computational overhead.

Evaluation and Datasets

A comprehensive look into evaluation datasets and methodologies provides a foundation for assessing the performance of LLMs in tasks ranging from language understanding to mathematical reasoning. The paper also discusses the construction and significance of diverse training datasets that cater to specific linguistic and domain-specific needs.

Challenges and Future Directions

The paper identifies several challenges facing LLMs, including computational costs, biases, overfitting, and interpretability. It calls for further exploration into areas like model scalability, adversarial robustness, and multi-modality. The authors propose enhancing the framework of LLMs through refined control mechanisms and prompt engineering to mitigate issues like hallucination and bias.

Conclusion and Implications

By offering a seasoned perspective on the progress and hurdles in LLM research, the paper serves as a comprehensive guide for developing improved models. It underscores the transformative potential of LLMs across domains, encouraging ongoing research to address existing challenges and leverage the strengths of these models for broader applications.

The authors’ structured approach and detailed exploration of LLMs provide an invaluable resource, aiding both seasoned researchers and newcomers in navigating the complexities and opportunities within this rapidly evolving field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (422)
  1. A. Chernyavskiy, D. Ilvovsky, and P. Nakov, “Transformers:“the end of history” for natural language processing?” in Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part III 21.   Springer, 2021, pp. 677–693.
  2. A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Superglue: A stickier benchmark for general-purpose language understanding systems,” Advances in neural information processing systems, vol. 32, 2019.
  3. D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu et al., “Towards a human-like open-domain chatbot,” arXiv preprint arXiv:2001.09977, 2020.
  4. B. A. y Arcas, “Do large language models understand us?” Daedalus, vol. 151, no. 2, pp. 183–197, 2022.
  5. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  6. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  7. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  8. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in NAACL-HLT.   Association for Computational Linguistics, 2018, pp. 2227–2237.
  9. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019.
  10. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  11. L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” arXiv preprint arXiv:2010.11934, 2020.
  12. Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, Z. Sun, Y. Yao, F. Qi, J. Guan, P. Ke et al., “Cpm-2: Large-scale cost-effective pre-trained language models,” AI Open, vol. 2, pp. 216–224, 2021.
  13. T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
  14. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
  15. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  16. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  17. V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al., “Multitask prompted training enables zero-shot task generalization,” arXiv preprint arXiv:2110.08207, 2021.
  18. Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap et al., “Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 5085–5109.
  19. Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language model with self generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
  20. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  21. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  22. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
  23. T. Webb, K. J. Holyoak, and H. Lu, “Emergent analogical reasoning in large language models,” Nature Human Behaviour, vol. 7, no. 9, pp. 1526–1541, 2023.
  24. D. A. Boiko, R. MacKnight, and G. Gomes, “Emergent autonomous scientific research capabilities of large language models,” arXiv preprint arXiv:2304.05332, 2023.
  25. G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Few-shot learning with retrieval augmented language models,” arXiv preprint arXiv:2208.03299, 2022.
  26. D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
  27. A. Parisi, Y. Zhao, and N. Fiedel, “Talm: Tool augmented language models,” arXiv preprint arXiv:2205.12255, 2022.
  28. B. Zhang and H. Soh, “Large language models as zero-shot human models for human-robot interaction,” arXiv preprint arXiv:2303.03548, 2023.
  29. Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi et al., “mplug-owl: Modularization empowers large language models with multimodality,” arXiv preprint arXiv:2304.14178, 2023.
  30. W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y. Qiao et al., “Visionllm: Large language model is also an open-ended decoder for vision-centric tasks,” arXiv preprint arXiv:2305.11175, 2023.
  31. R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, and Y. Shan, “Gpt4tools: Teaching large language model to use tools via self-instruction,” arXiv preprint arXiv:2305.18752, 2023.
  32. E. Saravia, “Prompt Engineering Guide,” https://github.com/dair-ai/Prompt-Engineering-Guide, 12 2022.
  33. A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained model,” arXiv preprint arXiv:2210.02414, 2022.
  34. Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi, “Codet5+: Open code large language models for code understanding and generation,” arXiv preprint arXiv:2305.07922, 2023.
  35. S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang, Y. Zhao, C. Pang et al., “Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation,” arXiv preprint arXiv:2112.12731, 2021.
  36. J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–3506.
  37. S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2020, pp. 1–16.
  38. J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” arXiv preprint arXiv:2110.04366, 2021.
  39. Z. Hu, Y. Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, and S. Poria, “Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models,” arXiv preprint arXiv:2304.01933, 2023.
  40. B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, 2021.
  41. X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
  42. A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, and S. Naidu, “Giraffe: Adventures in expanding context lengths in llms,” arXiv preprint arXiv:2308.10882, 2023.
  43. B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “Yarn: Efficient context window extension of large language models,” arXiv preprint arXiv:2309.00071, 2023.
  44. M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y.-H. Sung, and Y. Yang, “Longt5: Efficient text-to-text transformer for long sequences,” arXiv preprint arXiv:2112.07916, 2021.
  45. S. Chen, S. Wong, L. Chen, and Y. Tian, “Extending context window of large language models via positional interpolation,” arXiv preprint arXiv:2306.15595, 2023.
  46. W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  47. U. Naseem, I. Razzak, S. K. Khan, and M. Prasad, “A comprehensive survey on word representation models: From classical to state-of-the-art word representation language models,” Transactions on Asian and Low-Resource Language Information Processing, vol. 20, no. 5, pp. 1–35, 2021.
  48. B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heinz, and D. Roth, “Recent advances in natural language processing via large pre-trained language models: A survey,” arXiv preprint arXiv:2111.01243, 2021.
  49. C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He et al., “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,” arXiv preprint arXiv:2302.09419, 2023.
  50. Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022.
  51. J. Huang and K. C.-C. Chang, “Towards reasoning in large language models: A survey,” arXiv preprint arXiv:2212.10403, 2022.
  52. Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu, “Aligning large language models with human: A survey,” arXiv preprint arXiv:2307.12966, 2023.
  53. X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on model compression for large language models,” arXiv preprint arXiv:2308.07633, 2023.
  54. S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023.
  55. J. J. Webster and C. Kit, “Tokenization as the initial phase in nlp,” in COLING 1992 volume 4: The 14th international conference on computational linguistics, 1992.
  56. T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 66–75.
  57. R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1715–1725.
  58. S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gallé, A. Raja, C. Si, W. Y. Lee, B. Sagot et al., “Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp,” arXiv preprint arXiv:2112.10508, 2021.
  59. M. Schuster and K. Nakajima, “Japanese and korean voice search,” in 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2012, pp. 5149–5152.
  60. C. W. Eriksen and J. E. Hoffman, “Some characteristics of selective attention in visual perception determined by vocal reaction time,” Perception & Psychophysics, vol. 11, no. 2, pp. 169–171, 1972.
  61. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  62. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  63. R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, 2019.
  64. T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022.
  65. O. Press, N. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=R8sQPpGCv0
  66. J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” arXiv preprint arXiv:2104.09864, 2021.
  67. A. Kazemnejad, I. Padhi, K. N. Ramamurthy, P. Das, and S. Reddy, “The impact of positional encoding on length generalization in transformers,” arXiv preprint arXiv:2305.19466, 2023.
  68. K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
  69. V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
  70. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  71. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
  72. D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regularizing rnns by randomly preserving hidden activations,” arXiv preprint arXiv:1606.01305, 2016.
  73. N. Shazeer, “Glu variants improve transformer,” arXiv preprint arXiv:2002.05202, 2020.
  74. Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in International conference on machine learning.   PMLR, 2017, pp. 933–941.
  75. B. Zhang and R. Sennrich, “Root mean square layer normalization,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  76. A. Baevski and M. Auli, “Adaptive input representations for neural language modeling,” arXiv preprint arXiv:1809.10853, 2018.
  77. S. Shleifer, J. Weston, and M. Ott, “Normformer: Improved transformer pretraining with extra normalization,” arXiv preprint arXiv:2110.09456, 2021.
  78. H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and F. Wei, “Deepnet: Scaling transformers to 1,000 layers,” arXiv preprint arXiv:2203.00555, 2022.
  79. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
  80. “"bmtrain: Efficient training for big models.".” [Online]. Available: https://github.com/OpenBMB/BMTrain
  81. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45.
  82. J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne et al., “Jax: composable transformations of python+ numpy programs,” 2018.
  83. S. Li, J. Fang, Z. Bian, H. Liu, Y. Liu, H. Huang, B. Wang, and Y. You, “Colossal-ai: A unified deep learning system for large-scale parallel training,” arXiv preprint arXiv:2110.14883, 2021.
  84. J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, “Fastmoe: A fast mixture-of-expert training system,” arXiv preprint arXiv:2103.13262, 2021.
  85. L. Huawei Technologies Co., “Huawei mindspore ai development framework,” in Artificial Intelligence Technology.   Springer, 2022, pp. 137–162.
  86. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  87. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for large-scale machine learning.” in Osdi, vol. 16, no. 2016.   Savannah, GA, USA, 2016, pp. 265–283.
  88. T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
  89. Y. Tay, M. Dehghani, V. Q. Tran, X. Garcia, J. Wei, X. Wang, H. W. Chung, D. Bahri, T. Schuster, S. Zheng et al., “Ul2: Unifying language learning paradigms,” in The Eleventh International Conference on Learning Representations, 2022.
  90. P. J. Liu*, M. Saleh*, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer, “Generating wikipedia by summarizing long sequences,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=Hyg0vbWC-
  91. T. Wang, A. Roberts, D. Hesslow, T. Le Scao, H. W. Chung, I. Beltagy, J. Launay, and C. Raffel, “What language model architecture and pretraining objective works best for zero-shot generalization?” in International Conference on Machine Learning.   PMLR, 2022, pp. 22 964–22 984.
  92. L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon, “Unified language model pre-training for natural language understanding and generation,” Advances in neural information processing systems, vol. 32, 2019.
  93. S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura et al., “Opt-iml: Scaling language model instruction meta learning through the lens of generalization,” arXiv preprint arXiv:2212.12017, 2022.
  94. Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, and C. Gan, “Principle-driven self-alignment of language models from scratch with minimal human supervision,” arXiv preprint arXiv:2305.03047, 2023.
  95. A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma et al., “A general language assistant as a laboratory for alignment,” arXiv preprint arXiv:2112.00861, 2021.
  96. D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019.
  97. X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “Gpt understands, too,” arXiv preprint arXiv:2103.10385, 2021.
  98. N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2790–2799.
  99. S. Kim, S. J. Joo, D. Kim, J. Jang, S. Ye, J. Shin, and M. Seo, “The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning,” arXiv preprint arXiv:2305.14045, 2023.
  100. Q. Liu, F. Zhou, Z. Jiang, L. Dou, and M. Lin, “From zero to hero: Examining the power of symbolic tasks in instruction tuning,” arXiv preprint arXiv:2304.07995, 2023.
  101. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  102. X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022.
  103. S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” arXiv preprint arXiv:2305.10601, 2023.
  104. S. McCandlish, J. Kaplan, D. Amodei, and O. D. Team, “An empirical model of large-batch training,” arXiv preprint arXiv:1812.06162, 2018.
  105. W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, K. Wang, X. Zhang et al., “Pangu-α𝛼\alphaitalic_α : Large-scale autoregressive pretrained chinese language models with auto-parallel computation,” arXiv preprint arXiv:2104.12369, 2021.
  106. S. Yuan, H. Zhao, Z. Du, M. Ding, X. Liu, Y. Cen, X. Zou, Z. Yang, and J. Tang, “Wudaocorpora: A super large-scale chinese corpora for pre-training language models,” AI Open, vol. 2, pp. 65–68, 2021.
  107. Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu et al., “Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation,” arXiv preprint arXiv:2107.02137, 2021.
  108. Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” arXiv preprint arXiv:1901.02860, 2019.
  109. O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical details and evaluation,” White Paper. AI21 Labs, vol. 1, 2021.
  110. Y. Levine, N. Wies, O. Sharir, H. Bata, and A. Shashua, “Limits to depth efficiencies of self-attention,” Advances in Neural Information Processing Systems, vol. 33, pp. 22 640–22 651, 2020.
  111. B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, D. H. Jeon, S. Park, S. Kim, S. Kim, D. Seo et al., “What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers,” arXiv preprint arXiv:2109.04650, 2021.
  112. S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, H. Zhu, J. Luo, L. Xu et al., “Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning,” arXiv preprint arXiv:2110.04725, 2021.
  113. J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021.
  114. S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” arXiv preprint arXiv:2201.11990, 2022.
  115. S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang et al., “Gpt-neox-20b: An open-source autoregressive language model,” arXiv preprint arXiv:2204.06745, 2022.
  116. W. Ben and K. Aran, “Gpt-j-6b: A 6 billion parameter autoregressive language model,” 2021.
  117. P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., “Mixed precision training,” arXiv preprint arXiv:1710.03740, 2017.
  118. N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of language models with mixture-of-experts,” in International Conference on Machine Learning.   PMLR, 2022, pp. 5547–5569.
  119. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017.
  120. W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232–5270, 2022.
  121. J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022.
  122. S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky et al., “Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model,” arXiv preprint arXiv:2208.01448, 2022.
  123. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  124. Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So, S. Shakeri, X. Garcia, H. S. Zheng, J. Rao, A. Chowdhery et al., “Transcending scaling laws with 0.1% extra compute,” arXiv preprint arXiv:2210.11399, 2022.
  125. Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 320–335.
  126. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  127. M. N. Rabe and C. Staats, “Self-attention does not need o(n22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT) memory,” arXiv preprint arXiv:2112.05682, 2021.
  128. V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
  129. X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang, W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov et al., “Pangu-∑\sum∑: Towards trillion parameter language model with sparse heterogeneous computing,” arXiv preprint arXiv:2303.10845, 2023.
  130. E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
  131. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  132. Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022.
  133. N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv preprint arXiv:1911.02150, 2019.
  134. R. Y. Pang and H. He, “Text generation by learning from demonstrations,” arXiv preprint arXiv:2009.07839, 2020.
  135. R. Dabre and A. Fujita, “Softmax tempering for training neural machine translation models,” arXiv preprint arXiv:2009.09372, 2020.
  136. Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” arXiv preprint arXiv:2109.00859, 2021.
  137. R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
  138. R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large language model for science,” arXiv preprint arXiv:2211.09085, 2022.
  139. FairScale authors, “Fairscale: A general purpose modular pytorch library for high performance and large scale training,” https://github.com/facebookresearch/fairscale, 2021.
  140. R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
  141. S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” arXiv preprint arXiv:2303.17564, 2023.
  142. X. Zhang, Q. Yang, and D. Xu, “Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters,” arXiv preprint arXiv:2305.12002, 2023.
  143. W. Ben, “Mesh-transformer-jax: Model-parallel implementation of transformer language model with jax,” 2021.
  144. N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf et al., “Crosslingual generalization through multitask finetuning,” arXiv preprint arXiv:2211.01786, 2022.
  145. D. Yin, X. Liu, F. Yin, M. Zhong, H. Bansal, J. Han, and K.-W. Chang, “Dynosaur: A dynamic growth paradigm for instruction-tuning data curation,” arXiv preprint arXiv:2305.14327, 2023.
  146. P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yue et al., “Llama-adapter v2: Parameter-efficient visual instruction model,” arXiv preprint arXiv:2304.15010, 2023.
  147. “Openai. gpt-4 technical report,” 2023.
  148. R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
  149. W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/
  150. B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” arXiv preprint arXiv:2304.03277, 2023.
  151. T. Liu and B. K. H. Low, “Goat: Fine-tuned llama outperforms gpt-4 on arithmetic tasks,” arXiv preprint arXiv:2305.14201, 2023.
  152. H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and T. Liu, “Huatuo: Tuning llama model with chinese medical knowledge,” arXiv preprint arXiv:2304.06975, 2023.
  153. C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang, “Wizardlm: Empowering large language models to follow complex instructions,” arXiv preprint arXiv:2304.12244, 2023.
  154. Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large language models with evol-instruct,” arXiv preprint arXiv:2306.08568, 2023.
  155. J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, F. Song, M. Chadwick, M. Glaese, S. Young, L. Campbell-Gillingham, G. Irving et al., “Teaching language models to support answers with verified quotes,” arXiv preprint arXiv:2203.11147, 2022.
  156. R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browser-assisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021.
  157. A. Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al., “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.
  158. R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” arXiv preprint arXiv:2305.18290, 2023.
  159. H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, K. Shum, and T. Zhang, “Raft: Reward ranked finetuning for generative foundation model alignment,” arXiv preprint arXiv:2304.06767, 2023.
  160. Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang, “Rrhf: Rank responses to align language models with human feedback without tears,” arXiv preprint arXiv:2304.05302, 2023.
  161. F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, and H. Wang, “Preference ranking optimization for human alignment,” arXiv preprint arXiv:2306.17492, 2023.
  162. H. Liu, C. Sferrazza, and P. Abbeel, “Languages are rewards: Hindsight finetuning using human feedback,” arXiv preprint arXiv:2302.02676, 2023.
  163. Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022.
  164. Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpacafarm: A simulation framework for methods that learn from human feedback,” arXiv preprint arXiv:2305.14387, 2023.
  165. C. Si, Z. Gan, Z. Yang, S. Wang, J. Wang, J. Boyd-Graber, and L. Wang, “Prompting gpt-3 to be reliable,” arXiv preprint arXiv:2210.09150, 2022.
  166. D. Ganguli, A. Askell, N. Schiefer, T. Liao, K. Lukošiūtė, A. Chen, A. Goldie, A. Mirhoseini, C. Olsson, D. Hernandez et al., “The capacity for moral self-correction in large language models,” arXiv preprint arXiv:2302.07459, 2023.
  167. A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?” arXiv preprint arXiv:2307.02483, 2023.
  168. D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,” arXiv preprint arXiv:2209.07858, 2022.
  169. S. Casper, J. Lin, J. Kwon, G. Culp, and D. Hadfield-Menell, “Explore, establish, exploit: Red teaming language models from scratch,” arXiv preprint arXiv:2306.09442, 2023.
  170. E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” arXiv preprint arXiv:2202.03286, 2022.
  171. T. Scialom, T. Chakrabarty, and S. Muresan, “Fine-tuned language models are continual learners,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 6107–6122.
  172. Z. Shi and A. Lipani, “Don’t stop pretraining? make prompt-based fine-tuning powerful learner,” arXiv preprint arXiv:2305.01711, 2023.
  173. H. Gupta, S. A. Sawant, S. Mishra, M. Nakamura, A. Mitra, S. Mashetty, and C. Baral, “Instruction tuned models are quick learners,” arXiv preprint arXiv:2306.05539, 2023.
  174. H. Chen, Y. Zhang, Q. Zhang, H. Yang, X. Hu, X. Ma, Y. Yanggong, and J. Zhao, “Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning,” arXiv preprint arXiv:2305.09246, 2023.
  175. C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu et al., “Lima: Less is more for alignment,” arXiv preprint arXiv:2305.11206, 2023.
  176. C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, and S. Wang, “Lm-infinite: Simple on-the-fly length generalization for large language models,” arXiv preprint arXiv:2308.16137, 2023.
  177. J. Ainslie, T. Lei, M. de Jong, S. Ontañón, S. Brahma, Y. Zemlyanskiy, D. Uthus, M. Guo, J. Lee-Thorp, Y. Tay et al., “Colt5: Faster long-range transformers with conditional computation,” arXiv preprint arXiv:2303.09752, 2023.
  178. J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, and F. Wei, “Longnet: Scaling transformers to 1,000,000,000 tokens,” arXiv preprint arXiv:2307.02486, 2023.
  179. Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia, “Longlora: Efficient fine-tuning of long-context large language models,” arXiv preprint arXiv:2309.12307, 2023.
  180. N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar, O. Abend, E. Karpas, A. Shashua, K. Leyton-Brown, and Y. Shoham, “Parallel context windows for large language models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 6383–6402.
  181. A. Lykov and D. Tsetserukou, “Llm-brain: Ai-driven fast generation of robot behaviour tree based on large language model,” arXiv preprint arXiv:2305.19352, 2023.
  182. E. Billing, J. Rosén, and M. Lamb, “Language models for human-robot interaction,” in ACM/IEEE International Conference on Human-Robot Interaction, March 13–16, 2023, Stockholm, Sweden.   ACM Digital Library, 2023, pp. 905–906.
  183. Y. Ye, H. You, and J. Du, “Improved trust in human-robot collaboration with chatgpt,” IEEE Access, 2023.
  184. I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated robot task plans using large language models,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 11 523–11 530.
  185. Y. Zhen, S. Bi, L. Xing-tong, P. Wei-qin, S. Hai-peng, C. Zi-rui, and F. Yi-shu, “Robot task planning based on large language model representing knowledge with directed graph structures,” arXiv preprint arXiv:2306.05171, 2023.
  186. W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning.   PMLR, 2022, pp. 9118–9147.
  187. Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Task and motion planning with large language models for object rearrangement,” arXiv preprint arXiv:2303.06247, 2023.
  188. ——, “Leveraging commonsense knowledge from large language models for task and motion planning,” in RSS 2023 Workshop on Learning for Task and Motion Planning, 2023.
  189. Y. Ge, W. Hua, J. Ji, J. Tan, S. Xu, and Y. Zhang, “Openagi: When llm meets domain experts,” arXiv preprint arXiv:2304.04370, 2023.
  190. T. Zhong, Y. Wei, L. Yang, Z. Wu, Z. Liu, X. Wei, W. Li, J. Yao, C. Ma, X. Li et al., “Chatabl: Abductive learning via natural language interaction with chatgpt,” arXiv preprint arXiv:2304.11107, 2023.
  191. J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser, “Tidybot: Personalized robot assistance with large language models,” arXiv preprint arXiv:2305.05658, 2023.
  192. W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, T. Jackson, N. Brown, L. Luu, S. Levine, K. Hausman, and brian ichter, “Inner monologue: Embodied reasoning through planning with language models,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id=3R3Pz5i0tye
  193. S. S. Kannan, V. L. Venkatesh, and B.-C. Min, “Smart-llm: Smart multi-agent robot task planning using large language models,” arXiv preprint arXiv:2309.10062, 2023.
  194. I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Progprompt: program generation for situated robot task planning using large language models,” Autonomous Robots, pp. 1–14, 2023.
  195. C. Jin, W. Tan, J. Yang, B. Liu, R. Song, L. Wang, and J. Fu, “Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation,” arXiv preprint arXiv:2305.18898, 2023.
  196. G. Chalvatzaki, A. Younes, D. Nandha, A. T. Le, L. F. Ribeiro, and I. Gurevych, “Learning to reason over scene graphs: a case study of finetuning gpt-2 into a robot language model for grounded task planning,” Frontiers in Robotics and AI, vol. 10, p. 1221739, 2023.
  197. H. Ha, P. Florence, and S. Song, “Scaling up and distilling down: Language-guided robot skill acquisition,” arXiv preprint arXiv:2307.14535, 2023.
  198. Z. Mandi, S. Jain, and S. Song, “Roco: Dialectic multi-robot collaboration with large language models,” arXiv preprint arXiv:2307.04738, 2023.
  199. A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, and A. Velasquez, “Saynav: Grounding large language models for dynamic planning to navigation in new environments,” arXiv preprint arXiv:2309.04077, 2023.
  200. C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, “Llm-planner: Few-shot grounded planning for embodied agents with large language models,” arXiv preprint arXiv:2212.04088, 2022.
  201. V. S. Dorbala, J. F. Mullen Jr, and D. Manocha, “Can an embodied agent find your" cat-shaped mug"? llm-based zero-shot object navigation,” arXiv preprint arXiv:2303.03480, 2023.
  202. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 10 608–10 615.
  203. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
  204. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  205. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
  206. K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv preprint arXiv:2305.06355, 2023.
  207. M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” arXiv preprint arXiv:2306.05424, 2023.
  208. H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned audio-visual language model for video understanding,” arXiv preprint arXiv:2306.02858, 2023.
  209. X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv preprint arXiv:2303.17395, 2023.
  210. C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, and Z. Tu, “Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration,” arXiv preprint arXiv:2306.09093, 2023.
  211. D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
  212. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  213. W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” arXiv preprint arXiv:2305.06500, 2023.
  214. Z. Xu, Y. Shen, and L. Huang, “Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning,” arXiv preprint arXiv:2212.10773, 2022.
  215. Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, and J. Liu, “Chatbridge: Bridging modalities with large language model as a language catalyst,” arXiv preprint arXiv:2305.16103, 2023.
  216. L. Li, Y. Yin, S. Li, L. Chen, P. Wang, S. Ren, M. Li, Y. Yang, J. Xu, X. Sun et al., “M3 it: A large-scale dataset towards multi-modal multilingual instruction tuning,” arXiv preprint arXiv:2306.04387, 2023.
  217. R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, and L. K. T. Zhang, “Detgpt: Detect what you need via reasoning,” arXiv preprint arXiv:2305.14167, 2023.
  218. G. Luo, Y. Zhou, T. Ren, S. Chen, X. Sun, and R. Ji, “Cheap and quick: Efficient vision-language instruction tuning for large language models,” arXiv preprint arXiv:2305.15023, 2023.
  219. R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” arXiv preprint arXiv:2303.16199, 2023.
  220. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  221. Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,” arXiv preprint arXiv:2302.00923, 2023.
  222. J. Ge, H. Luo, S. Qian, Y. Gan, J. Fu, and S. Zhan, “Chain of thought prompt tuning in vision language models,” arXiv preprint arXiv:2304.07919, 2023.
  223. C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt: Talking, drawing and editing with visual foundation models,” arXiv preprint arXiv:2303.04671, 2023.
  224. Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv preprint arXiv:2303.11381, 2023.
  225. T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao, S. Zhao, Y. Shan et al., “Caption anything: Interactive image description with diverse multimodal controls,” arXiv preprint arXiv:2305.02677, 2023.
  226. X. Zhu, R. Zhang, B. He, Z. Zeng, S. Zhang, and P. Gao, “Pointclip v2: Adapting clip for powerful 3d open-world learning,” arXiv preprint arXiv:2211.11682, 2022.
  227. P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug-and-play compositional reasoning with large language models,” arXiv preprint arXiv:2304.09842, 2023.
  228. T. Gupta and A. Kembhavi, “Visual programming: Compositional visual reasoning without training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 953–14 962.
  229. P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li, “Dynamic fusion with intra-and inter-modality attention flow for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6639–6648.
  230. Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co-attention networks for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6281–6290.
  231. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  232. H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. A. Ayyubi, K.-W. Chang, and S.-F. Chang, “Idealgpt: Iteratively decomposing vision and language reasoning via large language models,” arXiv preprint arXiv:2305.14985, 2023.
  233. R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y. Qiao, P. Gao, and H. Li, “Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 211–15 222.
  234. W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,” arXiv preprint arXiv:2306.07174, 2023.
  235. X. Xu, Z. Gou, W. Wu, Z.-Y. Niu, H. Wu, H. Wang, and S. Wang, “Long time no see! open-domain conversation with long-term persona memory,” arXiv preprint arXiv:2203.05797, 2022.
  236. S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark et al., “Improving language models by retrieving from trillions of tokens,” in International conference on machine learning.   PMLR, 2022, pp. 2206–2240.
  237. W. Zhong, L. Guo, Q. Gao, and Y. Wang, “Memorybank: Enhancing large language models with long-term memory,” arXiv preprint arXiv:2305.10250, 2023.
  238. N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” arXiv preprint arXiv:2303.11366, vol. 14, 2023.
  239. C. Hu, J. Fu, C. Du, S. Luo, J. Zhao, and H. Zhao, “Chatdb: Augmenting llms with databases as their symbolic memory,” arXiv preprint arXiv:2306.03901, 2023.
  240. Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” arXiv preprint arXiv:2305.06983, 2023.
  241. O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, and Y. Shoham, “In-context retrieval-augmented language models,” arXiv preprint arXiv:2302.00083, 2023.
  242. X. Li and X. Qiu, “Mot: Pre-thinking and recalling enable chatgpt to self-improve with memory-of-thoughts,” arXiv preprint arXiv:2305.05181, 2023.
  243. D. Schuurmans, “Memory augmented large language models are computationally universal,” arXiv preprint arXiv:2301.04589, 2023.
  244. A. Modarressi, A. Imani, M. Fayyaz, and H. Schütze, “Ret-llm: Towards a general read-write memory for large language models,” arXiv preprint arXiv:2305.14322, 2023.
  245. S. Robertson, H. Zaragoza et al., “The probabilistic relevance framework: Bm25 and beyond,” Foundations and Trends® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.
  246. X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou, “Rationale-augmented ensembles in language models,” arXiv preprint arXiv:2207.00747, 2022.
  247. F. Zhang, B. Chen, Y. Zhang, J. Liu, D. Zan, Y. Mao, J.-G. Lou, and W. Chen, “Repocoder: Repository-level code completion through iterative retrieval and generation,” arXiv preprint arXiv:2303.12570, 2023.
  248. B. Wang, W. Ping, P. Xu, L. McAfee, Z. Liu, M. Shoeybi, Y. Dong, O. Kuchaiev, B. Li, C. Xiao et al., “Shall we pretrain autoregressive language models with retrieval? a comprehensive study,” arXiv preprint arXiv:2304.06762, 2023.
  249. L. Wang, N. Yang, and F. Wei, “Learning to retrieve in-context examples for large language models,” arXiv preprint arXiv:2307.07164, 2023.
  250. J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen, “What makes good in-context examples for gpt-3333?” arXiv preprint arXiv:2101.06804, 2021.
  251. O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context learning,” arXiv preprint arXiv:2112.08633, 2021.
  252. W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih, “Replug: Retrieval-augmented black-box language models,” arXiv preprint arXiv:2301.12652, 2023.
  253. O. Rubin and J. Berant, “Long-range language modeling with self-retrieval,” arXiv preprint arXiv:2306.13421, 2023.
  254. K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” in International conference on machine learning.   PMLR, 2020, pp. 3929–3938.
  255. S. Hofstätter, J. Chen, K. Raman, and H. Zamani, “Fid-light: Efficient and effective retrieval-augmented text generation,” in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 1437–1447.
  256. M. Komeili, K. Shuster, and J. Weston, “Internet-augmented dialogue generation,” arXiv preprint arXiv:2107.07566, 2021.
  257. A. Lazaridou, E. Gribovskaya, W. Stokowiec, and N. Grigorev, “Internet-augmented language models through few-shot prompting for open-domain question answering,” arXiv preprint arXiv:2203.05115, 2022.
  258. D. Gao, L. Ji, L. Zhou, K. Q. Lin, J. Chen, Z. Fan, and M. Z. Shou, “Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn,” arXiv preprint arXiv:2306.08640, 2023.
  259. B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, and M. T. Ribeiro, “Art: Automatic multi-step reasoning and tool-use for large language models,” arXiv preprint arXiv:2303.09014, 2023.
  260. C.-Y. Hsieh, S.-A. Chen, C.-L. Li, Y. Fujii, A. Ratner, C.-Y. Lee, R. Krishna, and T. Pfister, “Tool documentation enables zero-shot tool-usage with large language models,” arXiv preprint arXiv:2308.00675, 2023.
  261. Y. Song, W. Xiong, D. Zhu, C. Li, K. Wang, Y. Tian, and S. Li, “Restgpt: Connecting large language models with real-world applications via restful apis,” arXiv preprint arXiv:2306.06624, 2023.
  262. S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings,” arXiv preprint arXiv:2305.11554, 2023.
  263. S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive apis,” arXiv preprint arXiv:2305.15334, 2023.
  264. Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang, “On the tool manipulation capability of open-source large language models,” arXiv preprint arXiv:2305.16504, 2023.
  265. Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian et al., “Toolllm: Facilitating large language models to master 16000+ real-world apis,” arXiv preprint arXiv:2307.16789, 2023.
  266. Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface,” arXiv preprint arXiv:2303.17580, 2023.
  267. Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou, S. Lu, L. Ji, S. Mao et al., “Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis,” arXiv preprint arXiv:2303.16434, 2023.
  268. D. Surís, S. Menon, and C. Vondrick, “Vipergpt: Visual inference via python execution for reasoning,” arXiv preprint arXiv:2303.08128, 2023.
  269. X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, and D. Song, “Koala: A dialogue model for academic research,” Blog post, April 2023. [Online]. Available: https://bair.berkeley.edu/blog/2023/04/03/koala/
  270. L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima et al., “The pile: An 800gb dataset of diverse text for language modeling,” arXiv preprint arXiv:2101.00027, 2020.
  271. H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral, T. Le Scao, L. Von Werra, C. Mou, E. González Ponferrada, H. Nguyen et al., “The bigscience roots corpus: A 1.6 tb composite multilingual dataset,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 809–31 826, 2022.
  272. “Wikipedia.” [Online]. Available: https://en.wikipedia.org/wiki/Main_Page
  273. T. Computer, “Redpajama: An open source recipe to reproduce llama training dataset,” Apr. 2023. [Online]. Available: https://github.com/togethercomputer/RedPajama-Data
  274. O. Honovich, T. Scialom, O. Levy, and T. Schick, “Unnatural instructions: Tuning language models with (almost) no human labor,” arXiv preprint arXiv:2212.09689, 2022.
  275. Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
  276. D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020.
  277. A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,” arXiv preprint arXiv:2206.04615, 2022.
  278. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” arXiv preprint arXiv:1804.07461, 2018.
  279. Y. Yao, Q. Dong, J. Guan, B. Cao, Z. Zhang, C. Xiao, X. Wang, F. Qi, J. Bao, J. Nie et al., “Cuge: A chinese language understanding and generation evaluation benchmark,” arXiv preprint arXiv:2112.13610, 2021.
  280. L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu et al., “Clue: A chinese language understanding evaluation benchmark,” arXiv preprint arXiv:2004.05986, 2020.
  281. L. Xu, X. Lu, C. Yuan, X. Zhang, H. Xu, H. Yuan, G. Wei, X. Pan, X. Tian, L. Qin et al., “Fewclue: A chinese few-shot learning evaluation benchmark,” arXiv preprint arXiv:2107.07498, 2021.
  282. E. M. Smith, M. Williamson, K. Shuster, J. Weston, and Y.-L. Boureau, “Can you put it all together: Evaluating conversational agents’ ability to blend skills,” arXiv preprint arXiv:2004.08449, 2020.
  283. P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar et al., “Holistic evaluation of language models,” arXiv preprint arXiv:2211.09110, 2022.
  284. S. Park, J. Moon, S. Kim, W. I. Cho, J. Han, J. Park, C. Song, J. Kim, Y. Song, T. Oh et al., “Klue: Korean language understanding evaluation,” arXiv preprint arXiv:2105.09680, 2021.
  285. S. Reddy, D. Chen, and C. D. Manning, “Coqa: A conversational question answering challenge,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 249–266, 2019.
  286. M. T. Pilehvar and J. Camacho-Collados, “Wic: 10,000 example pairs for evaluating context-sensitive representations,” arXiv preprint arXiv:1808.09121, vol. 6, 2018.
  287. S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” arXiv preprint arXiv:1609.07843, 2016.
  288. J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Compressive transformers for long-range sequence modelling,” arXiv preprint arXiv:1911.05507, 2019.
  289. X. Liu, Q. Chen, C. Deng, H. Zeng, J. Chen, D. Li, and B. Tang, “Lcqmc: A large-scale chinese question matching corpus,” in Proceedings of the 27th international conference on computational linguistics, 2018, pp. 1952–1962.
  290. S. Iyer, N. Dandekar, and K. Csernai, “First quora dataset release: Question pairs,” https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs.
  291. R. Rudinger, J. Naradowsky, B. Leonard, and B. Van Durme, “Gender bias in coreference resolution,” arXiv preprint arXiv:1804.09301, 2018.
  292. M.-C. De Marneffe, M. Simons, and J. Tonhauser, “The commitmentbank: Investigating projection in naturally occurring discourse,” in proceedings of Sinn und Bedeutung, vol. 23, no. 2, 2019, pp. 107–124.
  293. Z. Li, N. Ding, Z. Liu, H. Zheng, and Y. Shen, “Chinese relation extraction with multi-grained information and external linguistic knowledge,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4377–4386.
  294. J. Xu, J. Wen, X. Sun, and Q. Su, “A discourse-level named entity recognition and relation extraction dataset for chinese literature text,” arXiv preprint arXiv:1711.07010, 2017.
  295. J. Chen, Q. Chen, X. Liu, H. Yang, D. Lu, and B. Tang, “The bq corpus: A large-scale domain-specific chinese corpus for sentence semantic equivalence identification,” in Proceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 4946–4951.
  296. B. Liu, D. Niu, H. Wei, J. Lin, Y. He, K. Lai, and Y. Xu, “Matching article pairs with graphical decomposition and convolutions,” arXiv preprint arXiv:1802.07459, 2018.
  297. P. Li, W. Li, Z. He, X. Wang, Y. Cao, J. Zhou, and W. Xu, “Dataset and neural recurrent sequence labeling model for open-domain factoid question answering,” arXiv preprint arXiv:1607.06275, 2016.
  298. N. Peng and M. Dredze, “Named entity recognition for chinese social media with jointly trained embeddings,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 548–554.
  299. W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Program induction by rationale generation: Learning to solve and explain algebraic word problems,” arXiv preprint arXiv:1705.04146, 2017.
  300. R. Weischedel, S. Pradhan, L. Ramshaw, M. Palmer, N. Xue, M. Marcus, A. Taylor, C. Greenberg, E. Hovy, R. Belvin et al., “Ontonotes release 4.0,” LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium, 2011.
  301. D. Vilares and C. Gómez-Rodríguez, “Head-qa: A healthcare dataset for complex reasoning,” arXiv preprint arXiv:1906.04701, 2019.
  302. S. L. Blodgett, L. Green, and B. O’Connor, “Demographic dialectal variation in social media: A case study of african-american english,” arXiv preprint arXiv:1608.08868, 2016.
  303. N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen, “A corpus and evaluation framework for deeper understanding of commonsense stories,” arXiv preprint arXiv:1604.01696, 2016.
  304. D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández, “The lambada dataset: Word prediction requiring a broad discourse context,” arXiv preprint arXiv:1606.06031, 2016.
  305. B. Hu, Q. Chen, and F. Zhu, “Lcsts: A large scale chinese short text summarization dataset,” arXiv preprint arXiv:1506.05865, 2015.
  306. Z. Shao, M. Huang, J. Wen, W. Xu, and X. Zhu, “Long and diverse text generation with planning-based hierarchical variational model,” arXiv preprint arXiv:1908.06605, 2019.
  307. J. Novikova, O. Dušek, and V. Rieser, “The e2e dataset: New challenges for end-to-end generation,” arXiv preprint arXiv:1706.09254, 2017.
  308. C. Zheng, M. Huang, and A. Sun, “Chid: A large-scale chinese idiom dataset for cloze test,” arXiv preprint arXiv:1906.01265, 2019.
  309. Y. Bisk, R. Zellers, J. Gao, Y. Choi et al., “Piqa: Reasoning about physical commonsense in natural language,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7432–7439.
  310. M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” arXiv preprint arXiv:1705.03551, 2017.
  311. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018.
  312. S. Aroca-Ouellette, C. Paik, A. Roncone, and K. Kann, “Prost: Physical reasoning of objects through space and time,” arXiv preprint arXiv:2106.03634, 2021.
  313. T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” arXiv preprint arXiv:1809.02789, 2018.
  314. T. C. Ferreira, C. Gardent, N. Ilinykh, C. Van Der Lee, S. Mille, D. Moussallem, and A. Shimorina, “The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020),” in Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), 2020.
  315. C. Xu, W. Zhou, T. Ge, K. Xu, J. McAuley, and F. Wei, “Blow the dog whistle: A chinese dataset for cant understanding with common sense and world knowledge,” arXiv preprint arXiv:2104.02704, 2021.
  316. G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “Race: Large-scale reading comprehension dataset from examinations,” arXiv preprint arXiv:1704.04683, 2017.
  317. E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and L. Zettlemoyer, “Quac: Question answering in context,” arXiv preprint arXiv:1808.07036, 2018.
  318. M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant, “Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 346–361, 2021.
  319. J. Boyd-Graber, B. Satinoff, H. He, and H. Daumé III, “Besting the quiz master: Crowdsourcing incremental classification games,” in Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, 2012, pp. 1290–1301.
  320. S. Zhang, X. Zhang, H. Wang, J. Cheng, P. Li, and Z. Ding, “Chinese medical question answer matching using end-to-end character-level multi-scale cnns,” Applied Sciences, vol. 7, no. 8, p. 767, 2017.
  321. S. Zhang, X. Zhang, H. Wang, L. Guo, and S. Liu, “Multi-scale attentive interaction networks for chinese medical question answer selection,” IEEE Access, vol. 6, pp. 74 061–74 071, 2018.
  322. C. Xu, J. Pei, H. Wu, Y. Liu, and C. Li, “Matinf: A jointly labeled large-scale dataset for classification, question answering and summarization,” arXiv preprint arXiv:2004.12302, 2020.
  323. K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, “Winogrande: An adversarial winograd schema challenge at scale,” Communications of the ACM, vol. 64, no. 9, pp. 99–106, 2021.
  324. R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?” arXiv preprint arXiv:1905.07830, 2019.
  325. M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice of plausible alternatives: An evaluation of commonsense causal reasoning.” in AAAI spring symposium: logical formalizations of commonsense reasoning, 2011, pp. 90–95.
  326. H. Levesque, E. Davis, and L. Morgenstern, “The winograd schema challenge,” in Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  327. A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Commonsenseqa: A question answering challenge targeting commonsense knowledge,” arXiv preprint arXiv:1811.00937, 2018.
  328. M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi, “Socialiqa: Commonsense reasoning about social interactions,” arXiv preprint arXiv:1904.09728, 2019.
  329. K. Sun, D. Yu, D. Yu, and C. Cardie, “Investigating prior knowledge for challenging chinese machine reading comprehension,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 141–155, 2020.
  330. S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. Van Durme, “Record: Bridging the gap between human and machine commonsense reading comprehension,” arXiv preprint arXiv:1810.12885, 2018.
  331. P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, 2016.
  332. C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,” arXiv preprint arXiv:1905.10044, 2019.
  333. P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” arXiv preprint arXiv:1806.03822, 2018.
  334. D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner, “Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs,” arXiv preprint arXiv:1903.00161, 2019.
  335. I. Dagan, O. Glickman, and B. Magnini, “The pascal recognising textual entailment challenge,” in Machine learning challenges workshop.   Springer, 2005, pp. 177–190.
  336. Y. Chang, M. Narang, H. Suzuki, G. Cao, J. Gao, and Y. Bisk, “Webqa: Multihop and multimodal qa,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 495–16 504.
  337. Y. Cui, T. Liu, Z. Chen, W. Ma, S. Wang, and G. Hu, “Dataset for the first evaluation on chinese machine reading comprehension,” arXiv preprint arXiv:1709.08299, 2017.
  338. Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu, “A span-extraction dataset for chinese machine reading comprehension,” arXiv preprint arXiv:1810.07366, 2018.
  339. Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, and G. Hu, “A sentence cloze dataset for chinese machine reading comprehension,” arXiv preprint arXiv:2004.03116, 2020.
  340. Y. Li, T. Liu, D. Li, Q. Li, J. Shi, and Y. Wang, “Character-based bilstm-crf incorporating pos and dictionaries for chinese opinion target extraction,” in Asian Conference on Machine Learning.   PMLR, 2018, pp. 518–533.
  341. D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth, “Looking beyond the surface: A challenge set for reading comprehension over multiple sentences,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 252–262.
  342. T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee et al., “Natural questions: a benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019.
  343. C. C. Shao, T. Liu, Y. Lai, Y. Tseng, and S. Tsai, “Drcd: A chinese machine reading comprehension dataset,” arXiv preprint arXiv:1806.00920, 2018.
  344. W. He, K. Liu, J. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, H. Wu, Q. She et al., “Dureader: a chinese machine reading comprehension dataset from real-world applications,” arXiv preprint arXiv:1711.05073, 2017.
  345. H. Tang, J. Liu, H. Li, Y. Hong, H. Wu, and H. Wang, “Dureaderrobust: A chinese dataset towards evaluating the robustness of machine reading comprehension models,” arXiv preprint arXiv:2004.11142, 2020.
  346. J. Welbl, N. F. Liu, and M. Gardner, “Crowdsourcing multiple choice science questions,” arXiv preprint arXiv:1707.06209, 2017.
  347. C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power, “End-to-end neural ad-hoc ranking with kernel pooling,” in Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval, 2017, pp. 55–64.
  348. A. Peñas, E. Hovy, P. Forner, Á. Rodrigo, R. Sutcliffe, and R. Morante, “Qa4mre 2011-2013: Overview of question answering for machine reading evaluation,” in Information Access Evaluation. Multilinguality, Multimodality, and Visualization: 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings 4.   Springer, 2013, pp. 303–320.
  349. S. Lim, M. Kim, and J. Lee, “Korquad1. 0: Korean qa dataset for machine reading comprehension,” arXiv preprint arXiv:1909.07005, 2019.
  350. C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang et al., “Cail2018: A large-scale legal dataset for judgment prediction,” arXiv preprint arXiv:1807.02478, 2018.
  351. D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song et al., “Measuring coding challenge competence with apps,” arXiv preprint arXiv:2105.09938, 2021.
  352. Y. Wang, X. Liu, and S. Shi, “Deep neural solver for math word problems,” in Proceedings of the 2017 conference on empirical methods in natural language processing, 2017, pp. 845–854.
  353. K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
  354. J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V. Le, and C. Sutton, “Program synthesis with large language models,” CoRR, vol. abs/2108.07732, 2021.
  355. F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou et al., “Language models are multilingual chain-of-thought reasoners,” arXiv preprint arXiv:2210.03057, 2022.
  356. S. Roy and D. Roth, “Solving general arithmetic word problems,” arXiv preprint arXiv:1608.01413, 2016.
  357. S.-Y. Miao, C.-C. Liang, and K.-Y. Su, “A diverse corpus for evaluating and developing english math word problem solvers,” arXiv preprint arXiv:2106.15772, 2021.
  358. R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman, and H. Hajishirzi, “Mawps: A math word problem repository,” in Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, 2016, pp. 1152–1157.
  359. A. Patel, S. Bhattamishra, and N. Goyal, “Are nlp models really able to solve simple math word problems?” arXiv preprint arXiv:2103.07191, 2021.
  360. Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-t. Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and reliable benchmark for data science code generation,” in International Conference on Machine Learning.   PMLR, 2023, pp. 18 319–18 345.
  361. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021.
  362. Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela, “Adversarial nli: A new benchmark for natural language understanding,” arXiv preprint arXiv:1910.14599, 2019.
  363. A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” arXiv preprint arXiv:1704.05426, 2017.
  364. R. T. McCoy, E. Pavlick, and T. Linzen, “Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference,” arXiv preprint arXiv:1902.01007, 2019.
  365. J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, and Y. Zhang, “Logiqa: A challenge dataset for machine reading comprehension with logical reasoning,” arXiv preprint arXiv:2007.08124, 2020.
  366. P. Lewis, B. Oğuz, R. Rinott, S. Riedel, and H. Schwenk, “Mlqa: Evaluating cross-lingual extractive question answering,” arXiv preprint arXiv:1910.07475, 2019.
  367. A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov, “Xnli: Evaluating cross-lingual sentence representations,” arXiv preprint arXiv:1809.05053, 2018.
  368. Y. Yang, Y. Zhang, C. Tar, and J. Baldridge, “Paws-x: A cross-lingual adversarial dataset for paraphrase identification,” arXiv preprint arXiv:1908.11828, 2019.
  369. S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details, just the summary!” Topic-Aware Convolutional Neural Networks for Extreme Summarization. ArXiv, abs, 1808.
  370. E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, I. Vulić, and A. Korhonen, “Xcopa: A multilingual dataset for causal commonsense reasoning,” arXiv preprint arXiv:2005.00333, 2020.
  371. A. Tikhonov and M. Ryabinin, “It’s all in the heads: Using attention heads as a baseline for cross-lingual transfer in commonsense reasoning,” arXiv preprint arXiv:2106.12066, 2021.
  372. J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Nikolaev, and J. Palomaki, “Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 454–470, 2020.
  373. T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, and J. Staiano, “Mlsum: The multilingual summarization corpus,” arXiv preprint arXiv:2004.14900, 2020.
  374. S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” arXiv preprint arXiv:2109.07958, 2021.
  375. I. Augenstein, C. Lioma, D. Wang, L. C. Lima, C. Hansen, C. Hansen, and J. G. Simonsen, “Multifc: A real-world multi-domain dataset for evidence-based fact checking of claims,” arXiv preprint arXiv:1909.03242, 2019.
  376. J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal, “Fever: a large-scale dataset for fact extraction and verification,” arXiv preprint arXiv:1803.05355, 2018.
  377. I. Mollas, Z. Chrysopoulou, S. Karlos, and G. Tsoumakas, “Ethos: an online hate speech detection dataset,” arXiv preprint arXiv:2006.08328, 2020.
  378. M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Measuring stereotypical bias in pretrained language models,” arXiv preprint arXiv:2004.09456, 2020.
  379. A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “Bbq: A hand-built bias benchmark for question answering,” arXiv preprint arXiv:2110.08193, 2021.
  380. J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang, “Gender bias in coreference resolution: Evaluation and debiasing methods,” arXiv preprint arXiv:1804.06876, 2018.
  381. N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, “Crows-pairs: A challenge dataset for measuring social biases in masked language models,” arXiv preprint arXiv:2010.00133, 2020.
  382. S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,” arXiv preprint arXiv:2009.11462, 2020.
  383. D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman, “Nuanced metrics for measuring unintended bias with real data for text classification,” in Companion proceedings of the 2019 world wide web conference, 2019, pp. 491–500.
  384. O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, V. Logacheva, C. Monz et al., “Findings of the 2016 conference on machine translation,” in Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016, pp. 131–198.
  385. B. Loïc, B. Magdalena, B. Ondřej, F. Christian, G. Yvette, G. Roman, H. Barry, H. Matthias, J. Eric, K. Tom et al., “Findings of the 2020 conference on machine translation (wmt20),” in Proceedings of the Fifth Conference on Machine Translation.   Association for Computational Linguistics,, 2020, pp. 1–55.
  386. W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang, “Ccpm: A chinese classical poetry matching dataset,” arXiv preprint arXiv:2106.01979, 2021.
  387. E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston, “Wizard of wikipedia: Knowledge-powered conversational agents,” arXiv preprint arXiv:1811.01241, 2018.
  388. H. Rashkin, E. M. Smith, M. Li, and Y.-L. Boureau, “Towards empathetic open-domain conversation models: A new benchmark and dataset,” arXiv preprint arXiv:1811.00207, 2018.
  389. E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, D. Kiela, A. Szlam, I. Serban, R. Lowe et al., “The second conversational intelligence challenge (convai2),” in The NeurIPS’18 Competition: From Machine Learning to Intelligent Conversations.   Springer, 2020, pp. 187–208.
  390. H. Zhou, C. Zheng, K. Huang, M. Huang, and X. Zhu, “Kdconv: A chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation,” arXiv preprint arXiv:2004.04100, 2020.
  391. L. CO, “Iflytek: a multiple categories chinese text classifier. competition official website,” 2019.
  392. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  393. J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn, “The pushshift reddit dataset,” in Proceedings of the international AAAI conference on web and social media, vol. 14, 2020, pp. 830–839.
  394. A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli, “Eli5: Long form question answering,” arXiv preprint arXiv:1907.09190, 2019.
  395. Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap et al., “Benchmarking generalization via in-context instructions on 1,600+ language tasks,” arXiv preprint arXiv:2204.07705, 2022.
  396. T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-S. Wu, M. Zhong, P. Yin, S. I. Wang et al., “Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models,” arXiv preprint arXiv:2201.05966, 2022.
  397. Q. Ye, B. Y. Lin, and X. Ren, “Crossfit: A few-shot learning challenge for cross-task generalization in nlp,” arXiv preprint arXiv:2104.08835, 2021.
  398. V. Aribandi, Y. Tay, T. Schuster, J. Rao, H. S. Zheng, S. V. Mehta, H. Zhuang, V. Q. Tran, D. Bahri, J. Ni et al., “Ext5: Towards extreme multi-task scaling for transfer learning,” arXiv preprint arXiv:2111.10952, 2021.
  399. A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).   New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 1112–1122. [Online]. Available: https://aclanthology.org/N18-1101
  400. Y. Zhang, J. Baldridge, and L. He, “PAWS: Paraphrase adversaries from word scrambling,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 1298–1308. [Online]. Available: https://aclanthology.org/N19-1131
  401. T. Q. Nguyen and J. Salazar, “Transformers without tears: Improving the normalization of self-attention,” CoRR, vol. abs/1910.05895, 2019.
  402. E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in nlp,” arXiv preprint arXiv:1906.02243, 2019.
  403. E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?” in Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 2021, pp. 610–623.
  404. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning (still) requires rethinking generalization,” Communications of the ACM, vol. 64, no. 3, pp. 107–115, 2021.
  405. M. Tänzer, S. Ruder, and M. Rei, “Memorisation versus generalisation in pre-trained language models,” arXiv preprint arXiv:2105.00828, 2021.
  406. S. M. West, M. Whittaker, and K. Crawford, “Discriminating systems,” AI Now, pp. 1–33, 2019.
  407. K. Valmeekam, A. Olmo, S. Sreedharan, and S. Kambhampati, “Large language models still can’t plan (a benchmark for llms on planning and reasoning about change),” arXiv preprint arXiv:2206.10498, 2022.
  408. Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen et al., “Siren’s song in the ai ocean: A survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
  409. A. Webson and E. Pavlick, “Do prompt-based models really understand the meaning of their prompts?” arXiv preprint arXiv:2109.01247, 2021.
  410. O. Shaikh, H. Zhang, W. Held, M. Bernstein, and D. Yang, “On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning,” arXiv preprint arXiv:2212.08061, 2022.
  411. X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, and J. Gao, “Adversarial training for large neural language models,” ArXiv, April 2020. [Online]. Available: https://www.microsoft.com/en-us/research/publication/adversarial-training-for-large-neural-language-models/
  412. E. Shayegani, M. A. A. Mamun, Y. Fu, P. Zaree, Y. Dong, and N. Abu-Ghazaleh, “Survey of vulnerabilities in large language models revealed by adversarial attacks,” 2023.
  413. X. Xu, K. Kong, N. Liu, L. Cui, D. Wang, J. Zhang, and M. Kankanhalli, “An llm can fool itself: A prompt-based adversarial attack,” 2023.
  414. H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, and M. Du, “Explainability for large language models: A survey,” 2023.
  415. S. Huang, S. Mamidanna, S. Jangam, Y. Zhou, and L. H. Gilpin, “Can large language models explain themselves? a study of llm-generated self-explanations,” 2023.
  416. H. Brown, K. Lee, F. Mireshghallah, R. Shokri, and F. Tramèr, “What does it mean for a language model to preserve privacy?” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 2280–2292.
  417. R. Plant, V. Giuffrida, and D. Gkatzia, “You are what you write: Preserving privacy in the era of large language models,” arXiv preprint arXiv:2204.09391, 2022.
  418. W. Niu, Z. Kong, G. Yuan, W. Jiang, J. Guan, C. Ding, P. Zhao, S. Liu, B. Ren, and Y. Wang, “Real-time execution of large-scale language models on mobile,” 2020.
  419. C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y. Liu, M. Guo, and Y. Zhu, “Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–15.
  420. B. Meskó and E. J. Topol, “The imperative for regulatory oversight of large language models (or generative ai) in healthcare,” npj Digital Medicine, vol. 6, no. 1, p. 120, 2023.
  421. J. Zhang, X. Ji, Z. Zhao, X. Hei, and K.-K. R. Choo, “Ethical considerations and policy implications for large language models: Guiding responsible development and deployment,” arXiv preprint arXiv:2308.02678, 2023.
  422. J. Mökander, J. Schuett, H. R. Kirk, and L. Floridi, “Auditing large language models: a three-layered approach,” AI and Ethics, pp. 1–31, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Humza Naveed (2 papers)
  2. Asad Ullah Khan (2 papers)
  3. Shi Qiu (42 papers)
  4. Muhammad Saqib (16 papers)
  5. Saeed Anwar (64 papers)
  6. Muhammad Usman (172 papers)
  7. Naveed Akhtar (77 papers)
  8. Nick Barnes (81 papers)
  9. Ajmal Mian (136 papers)
Citations (322)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews