Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Battle of LLMs: A Comparative Study in Conversational QA Tasks (2405.18344v1)

Published 28 May 2024 in cs.CL and cs.AI

Abstract: LLMs have gained considerable interest for their impressive performance on various tasks. Within this domain, ChatGPT and GPT-4, developed by OpenAI, and the Gemini, developed by Google, have emerged as particularly popular among early adopters. Additionally, Mixtral by Mistral AI and Claude by Anthropic are newly released, further expanding the landscape of advanced LLMs. These models are viewed as disruptive technologies with applications spanning customer service, education, healthcare, and finance. More recently, Mistral has entered the scene, captivating users with its unique ability to generate creative content. Understanding the perspectives of these users is crucial, as they can offer valuable insights into the potential strengths, weaknesses, and overall success or failure of these technologies in various domains. This research delves into the responses generated by ChatGPT, GPT-4, Gemini, Mixtral and Claude across different Conversational QA corpora. Evaluation scores were meticulously computed and subsequently compared to ascertain the overall performance of these models. Our study pinpointed instances where these models provided inaccurate answers to questions, offering insights into potential areas where they might be susceptible to errors. In essence, this research provides a comprehensive comparison and evaluation of these state of-the-art LLMs, shedding light on their capabilities while also highlighting potential areas for improvement

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Abhaya Agarwal and Alon Lavie. 2008. Meteor, M-BLEU and M-TER: Evaluation metrics for high-correlation with human rankings of machine translation output. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 115–118, Columbus, Ohio. Association for Computational Linguistics.
  2. Towards human-bot collaborative software architecting with chatgpt.
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
  5. Som Biswas. 2023. Chatgpt and the future of medical writing. Radiology, page 223312.
  6. Codah: An adversarially authored question-answer dataset for common sense.
  7. How will language modelers like chatgpt affect occupations and industries?
  8. Roberto Gozalo-Brizuela and Eduardo C. Garrido-Merchan. 2023. Chatgpt is not all you need. a state of the art review of large generative ai models.
  9. How close is chatgpt to human experts? comparison corpus, evaluation, and detection.
  10. Dialfact: A benchmark for fact-checking in dialogue.
  11. Regulating chatgpt and other large generative ai models.
  12. "i think this is the most disruptive technology": Exploring sentiments of chatgpt early adopters using twitter data.
  13. Chatgpt makes medicine easy to swallow: An exploratory case study on simplified radiology reports.
  14. Mistral 7b.
  15. Mohammad Khalil and Erkan Er. 2023. Will chatgpt get you caught? rethinking of plagiarism detection.
  16. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  17. Uzh_clyp at semeval-2023 task 9: Head-first fine-tuning and chatgpt data generation for cross-lingual learning in tweet intimacy prediction.
  18. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
  19. Faviq: Fact verification from information-seeking questions.
  20. Is chatgpt a general-purpose natural language processing task solver?
  21. Improving language understanding by generative pre-training.
  22. Can chatgpt assess human personalities? a general evaluation framework.
  23. The troubling emergence of hallucination in large language models – an extensive definition, quantification, and prescriptive remediations.
  24. Coqa: A conversational question answering challenge.
  25. ChatGPT and other large language models are double-edged swords. Radiology, page 230163.
  26. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208.
  27. Teo Susnjak. 2022. Chatgpt: The end of online exam integrity?
  28. Wilbert Tabone and Joost de Winter. 2023. Using chatgpt for human-computer interaction research: A primer.
  29. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  30. Zifu Wang and Matthew B. Blaschko. 2023. Jaccard metric losses: Optimizing the jaccard index with soft labels.
  31. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  32. Bartscore: Evaluating generated text as text generation.
  33. Interpreting bleu/nist scores: How much improvement do we need to have a better system? In International Conference on Language Resources and Evaluation.
  34. Exploring ai ethics of chatgpt: A diagnostic analysis.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Aryan Rangapur (1 paper)
  2. Aman Rangapur (10 papers)
Citations (3)