Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity (2302.04023v4)

Published 8 Feb 2023 in cs.CL and cs.AI
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Abstract: This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion. We also release codebase for evaluation set extraction.

Overview of Multitask, Multilingual, and Multimodal Evaluation of ChatGPT

The paper, titled "A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity," presents a systematic framework for evaluating the capabilities of ChatGPT. The evaluation leverages 23 datasets encompassing eight diverse NLP tasks, namely question answering, reasoning, summarization, machine translation, sentiment analysis, language identification, task-oriented dialogue, and misinformation detection. Recognizing the absence of previous benchmarking results for ChatGPT, this paper provides a comprehensive third-party evaluation to assess the multitask, multilingual, and multimodal aspects of this model, alongside its reasoning abilities and interactive features.

Evaluation Framework

The framework's haLLMark is its inclusiveness in testing various facets of ChatGPT's capabilities:

  • Multitask Evaluation: The evaluation spans multiple standard NLP tasks using datasets like CNN/DM, SAMSum, FLoRes-200, bAbI, EntailmentBank, and more. ChatGPT's performance is juxtaposed with prior state-of-the-art (SOTA) models in both fine-tuned and zero-shot settings.
  • Multilingual Evaluation: ChatGPT's understanding and generation abilities are tested across high-resource and low-resource languages. Sentiment analysis, language identification, and machine translation are used to probe these linguistic capabilities.
  • Multimodal Evaluation: Explores how ChatGPT handles text-to-image generation, focusing on the creation of images from textual descriptions via intermediate code representation.

Numerical Results and Performance Insights

  1. Multitask Performance:
    • ChatGPT surpasses previous zero-shot LLMs on 9 out of 13 datasets and even exceeds certain fine-tuned task-specific models. However, limitations are noted in task-oriented and knowledge-grounded dialogue tasks.
    • It effectively handles tasks like summarization using datasets such as CNN/DM and SAMSum, although it demonstrates variability in performance compared to fine-tuned models like Bart.
  2. Multilingual Capabilities:
    • ChatGPT performs well in high-resource languages such as French and Chinese but struggles with low-resource languages like Javanese and Buginese.
    • It displays better understanding than generation ability in non-Latin scripts, indicating a gap in equivalently handling tasks across diverse scripts.
  3. Multimodal Abilities:
    • The flag drawing task demonstrates ChatGPT's potential to convert text into visual code (e.g., SVG), though it highlights basic limitations in accurately depicting complex shapes and sizes.
    • The multimodal generation efficiency improves considerably with iterative refinement across multiple turns.

Reasoning and Hallucination Evaluations

  1. Reasoning Skills:
    • Detailed evaluations span deductive, inductive, abductive, and commonsense reasoning, using datasets like α\alphaNLI, CommonsenseQA, and HotpotQA.
    • ChatGPT exhibits notably high performance in deductive reasoning but is less reliable in tasks demanding inductive and multi-hop reasoning.
  2. Hallucination and Factuality:
    • The common issue of hallucinations is affirmed, where ChatGPT generates extrinsic hallucinations, including both factual and non-factual, across tasks like machine translation and summarization.
    • Evaluations with TruthfulQA reveal that ChatGPT sometimes replicates human falsehoods, emphasizing the need for improved factuality controls.

Interactivity and Iterative Improvements

ChatGPT's interactive abilities are a significant differentiator from its predecessors:

  • In summarization tasks, iterative prompts help refine outputs to be more concise, improving ROUGE scores.
  • Machine translation benefits from multi-turn interactions, where post-editing helps correct and improve translations, especially for low-resource languages.
  • Multimodal interactions allow for iterative refinements in image generation, akin to human interaction in creative tasks.

Implications and Future Directions

ChatGPT's evaluation highlights several practical and theoretical implications:

  • Practical: The model's utility in multilingual contexts and interactive applications is promising for real-world deployments. However, refinement in low-resource language support and iterative correction strategies needs to be deepened.
  • Theoretical: Future research should address enhancing deductive and inductive reasoning frameworks, refining RLHF (Reinforcement Learning with Human Feedback) for improved factual accuracy, and developing robust mechanisms to mitigate hallucinations.

The framework proposed by the paper sets a comprehensive benchmark for assessing the evolving capabilities of LLMs and can guide future developments in AI. These findings stress the criticality of diverse and iterative evaluation methods in understanding and advancing the performance of LLMs like ChatGPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (152)
  1. One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7226–7249, Dublin, Ireland. Association for Computational Linguistics.
  2. Ömer Aydın and Enis Karaarslan. 2022. Openai chatgpt generated literature review: Digital twin in healthcare. Available at SSRN 4308687.
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  4. Paul Bartha. 2013. Analogy and analogical reasoning.
  5. Abductive commonsense reasoning. In International Conference on Learning Representations.
  6. Prajjwal Bhargava and Vincent Ng. 2022. Commonsense knowledge reasoning and generation with pre-trained language models: a survey. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36 (11), pages 12317–12325.
  7. Findings of the wmt 2022 shared task on automatic post-editing. In Proceedings of the Seventh Conference on Machine Translation, pages 109–117, Abu Dhabi.
  8. David G.W. Birch. 2022. Chatgpt is a window into the real future of financial services.
  9. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34 (05), pages 7432–7439.
  10. The role of ai in drug discovery: Challenges, opportunities, and strategies. arXiv preprint arXiv:2212.08104.
  11. Nusacrowd: Open source initiative for indonesian nlp resources.
  12. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  13. Ethan C. Chau and Noah A. Smith. 2021. Specializing multilingual language models: An empirical study. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 51–61, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  14. Chatgpt goes to law school. Available at SSRN.
  15. Palm: Scaling language modeling with pathways.
  16. Deep reinforcement learning from human preferences.
  17. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  18. Cookup.ai. 2022. Chatgpt - where it lacks.
  19. Enabling multimodal generation on CLIP via vision-language knowledge distillation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2383–2395, Dublin, Ireland. Association for Computational Linguistics.
  20. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  21. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2136–2148, Dubrovnik, Croatia. Association for Computational Linguistics.
  22. Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370.
  23. Ernest Davis. 2023a. Benchmarks for automated commonsense reasoning: A survey. arXiv preprint arXiv:2302.04752.
  24. Ernest Davis. 2023b. Mathematics, word problems, common sense, and artificial intelligence. arXiv preprint arXiv:2301.09723.
  25. Tech Desk. 2023a. Chatgpt vs satya nadella over biryani: The chatbot is learning from its mistakes.
  26. Web Desk. 2023b. Colombian judge uses chatgpt in ruling, triggers debate.
  27. Igor Douven. 2017. Abduction.
  28. Michael Dowling and Brian Lucey. 2023. Chatgpt for (finance) research: The bananarama conjecture. Finance Research Letters, page 103662.
  29. e-CARE: a new dataset for exploring explainable causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland. Association for Computational Linguistics.
  30. Mathematical capabilities of chatgpt.
  31. A framework for few-shot language model evaluation.
  32. How well does chatgpt do when taking the medical licensing exams? the implications of large language models for medical education and knowledge assessment. medRxiv, pages 2022–12.
  33. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, page 70.
  34. Yoav Goldberg. 2023. Some remarks on large language models.
  35. Cindy Gordon. 2023. Chatgpt is the fastest growing app in the history of web applications.
  36. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  37. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356.
  38. Roberto Gozalo-Brizuela and Eduardo C Garrido-Merchan. 2023. Chatgpt is not all you need. a state of the art review of large generative ai models. arXiv preprint arXiv:2301.04655.
  39. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597.
  40. James Hawthorne. 2021. Inductive Logic. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy, Spring 2021 edition. Metaphysics Research Lab, Stanford University.
  41. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  42. Training compute-optimal large language models.
  43. Krystal Hu. 2023. Chatgpt sets record for fastest-growing user base - analyst note.
  44. Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
  45. Aspect detection and sentiment classification using deep neural network for indonesian aspect-based sentiment analysis. In 2018 International Conference on Asian Language Processing (IALP), pages 62–67.
  46. Hadar Yoana Jabotinsky and Roee Sarel. 2022. Co-authoring with an ai? ethical dilemmas and artificial intelligence. Ethical Dilemmas and Artificial Intelligence (December 15, 2022).
  47. Chatgpt makes medicine easy to swallow: An exploratory case study on simplified radiology reports. arXiv preprint arXiv:2212.14882.
  48. Survey of hallucination in natural language generation. ACM Comput. Surv. Just Accepted.
  49. Rho (ρ𝜌\rhoitalic_ρ): Reducing hallucination in open-domain dialogues with knowledge grounding. arXiv preprint arXiv:2212.01588.
  50. Is chatgpt a good translator? a preliminary study.
  51. Arianna Johnson. 2023. Is chatgpt partisan? poems about trump and biden raise questions about the ai bot’s bias-here’s what experts think.
  52. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  53. Jennifer A. Kingson. 2023. Friend or foe? teachers debate chatgpt.
  54. Chatgpt: Jack of all trades, master of none. Information Fusion, page 101861.
  55. Escape Velocity Labs. 2022. Chatgpt imitates logical reasoning surprisingly well.
  56. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613.
  57. A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets. In Findings of the Association for Computational Linguistics: ACL 2023, pages 431–469, Toronto, Canada. Association for Computational Linguistics.
  58. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. arXiv preprint arXiv:2305.18486.
  59. Anton E Lawson. 2005. What is the role of induction and deduction in reasoning and scientific inquiry? Journal of Research in Science Teaching, 42(6):716–740.
  60. Towards few-shot fact-checking via perplexity. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1971–1981, Online. Association for Computational Linguistics.
  61. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems.
  62. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  63. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  64. Holistic evaluation of language models.
  65. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  66. Zero-shot dialogue state tracking via cross-task transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7890–7900.
  67. Every picture tells a story: Image-grounded controllable stylistic story generation. In Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 40–52, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
  68. Trip: Triangular document-level pre-training for multilingual language models. arXiv preprint arXiv:2212.07752.
  69. Few-shot bot: Prompt-based learning for dialogue systems. arXiv preprint arXiv:2110.08118.
  70. Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627.
  71. Gpteval: A survey on assessments of chatgpt and gpt-4. arXiv preprint arXiv:2308.12488.
  72. Bernard Marr. 2022. What does chatgpt really mean for businesses?
  73. A survey on multi-hop question answering and generation. arXiv preprint arXiv:2204.09140.
  74. Learning reasoning strategies in end-to-end differentiable proving. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
  75. Roshanak Mirzaee and Parisa Kordjamshidi. 2022. Transfer learning with synthetic corpora for spatial role labeling and reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6148–6165, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  76. SPARTQA: A textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4582–4598, Online. Association for Computational Linguistics.
  77. Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854.
  78. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  79. When being unseen from mBERT is just the beginning: Handling new languages with multilingual language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 448–462, Online. Association for Computational Linguistics.
  80. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
  81. Tomáš Nekvinda and Ondřej Dušek. 2021. Shades of BLEU, flavours of success: The case of MultiWOZ. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 34–46, Online. Association for Computational Linguistics.
  82. Putting chatgpt’s medical advice to the (turing) test. medRxiv, pages 2023–01.
  83. OpenAI. 2023. Gpt-4 technical report.
  84. Thoughtsource: A central hub for large language model reasoning data. arXiv preprint arXiv:2301.11596.
  85. Training language models to follow instructions with human feedback.
  86. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  87. Modeling event plausibility with consistent conceptual abstraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1732–1743, Online. Association for Computational Linguistics.
  88. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  89. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597.
  90. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  91. TIMEDIAL: Temporal commonsense reasoning in dialog. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7066–7076, Online. Association for Computational Linguistics.
  92. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  93. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  94. Scaling language models: Methods, analysis and insights from training gopher.
  95. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
  96. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR.
  97. Fabin Rasheed. 2020. Gpt3 sees.
  98. Partha Pratim Ray. 2023. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems.
  99. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Comput. Surv. Just Accepted.
  100. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695.
  101. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations.
  102. Stephen Shankland. 2023. Why the chatgpt ai chatbot is blowing everyone’s mind.
  103. Chatgpt and other large language models are double-edged swords.
  104. Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11321–11329.
  105. StepGame: A new benchmark for robust multi-hop spatial reasoning in texts. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11321–11329.
  106. Denis Shiryaev. 2022. Drawing mona lisa with chatgpt.
  107. Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage.
  108. Clutrr: A diagnostic benchmark for inductive reasoning from text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4506–4515.
  109. Noah Smith. 2023. Why does chatgpt constantly lie?
  110. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  111. Commonsense reasoning for natural language understanding: A survey of benchmarks, resources, and approaches. arXiv preprint arXiv:1904.01172, pages 1–60.
  112. Read before generate! faithful long form question answering with machine reading. In Findings of the Association for Computational Linguistics: ACL 2022, pages 744–756.
  113. Improve query focused abstractive summarization by incorporating answer relevance. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3124–3131.
  114. Recitation-augmented language models. In The Eleventh International Conference on Learning Representations.
  115. Teo Susnjak. 2022. Chatgpt: The end of online exam integrity? arXiv preprint arXiv:2212.09292.
  116. olmpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics, 8:743–758.
  117. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
  118. No language left behind: Scaling human-centered machine translation.
  119. Richmond Thomason. 2018. Logic and artificial intelligence.
  120. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  121. H Holden Thorp. 2023. Chatgpt is fun, but not an author.
  122. Giuseppe Venuto. 2023. Giuven95/chatgpt-failures: Chatgpt failure archive.
  123. Douglas Walton. 2014. Abductive reasoning. University of Alabama Press.
  124. Ada Wan. 2022. Fairness in representation for multilingual NLP: Insights from controlled experiments on conditional language modeling. In International Conference on Learning Representations.
  125. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  126. Modeling semantic plausibility by injecting world knowledge. arXiv preprint arXiv:1804.00619.
  127. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  128. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  129. Peter Cathcart Wason and Philip Nicholas Johnson-Laird. 1972. Psychology of reasoning: Structure and content, volume 86. Harvard University Press.
  130. Emergent analogical reasoning in large language models.
  131. Emergent analogical reasoning in large language models. arXiv preprint arXiv:2212.09196.
  132. Emergent abilities of large language models. Transactions on Machine Learning Research.
  133. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  134. Towards ai-complete question answering: A set of prerequisite toy tasks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  135. Towards ai-complete question answering: A set of prerequisite toy tasks. In 4th International Conference on Learning Representations, ICLR 2016.
  136. Indonlu: Benchmark and resources for evaluating indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857.
  137. Nusax: Multilingual parallel sentiment dataset for 10 indonesian local languages.
  138. Cameron R. Wolfe. 2023. Specialized llms: Chatgpt, lamda, galactica, codex, sparrow, and more.
  139. Bloom: A 176b-parameter open-access multilingual language model.
  140. Retrieval-free knowledge-grounded dialogue response generation with adapters. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pages 93–107.
  141. Diverse and faithful knowledge-grounded dialogue generation via sequential posterior inference. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38518–38534. PMLR.
  142. Ubar: Towards fully end-to-end task-oriented dialog system with gpt-2. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16):14230–14238.
  143. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
  144. Vision guided generative pre-trained language models for multimodal abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3995–4007, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  145. Adaptsum: Towards low-resource domain adaptation for abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5892–5904.
  146. Multiwoz 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines. ACL 2020, page 109.
  147. Star: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems.
  148. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  149. Description-driven task-oriented dialog modeling. ArXiv, abs/2201.08904.
  150. Knowledge-grounded dialogue generation with pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3377–3390.
  151. Exploring ai ethics of chatgpt: A diagnostic analysis. arXiv preprint arXiv:2301.12867.
  152. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Yejin Bang (25 papers)
  2. Samuel Cahyawijaya (75 papers)
  3. Nayeon Lee (28 papers)
  4. Wenliang Dai (24 papers)
  5. Dan Su (101 papers)
  6. Bryan Wilie (24 papers)
  7. Holy Lovenia (30 papers)
  8. Ziwei Ji (42 papers)
  9. Tiezheng Yu (29 papers)
  10. Willy Chung (10 papers)
  11. Quyet V. Do (7 papers)
  12. Yan Xu (258 papers)
  13. Pascale Fung (150 papers)
Citations (1,160)
Youtube Logo Streamline Icon: https://streamlinehq.com