Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations (2407.04069v2)

Published 4 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Md Tahmid Rahman Laskar (30 papers)
  2. Sawsan Alqahtani (7 papers)
  3. M Saiful Bari (22 papers)
  4. Mizanur Rahman (60 papers)
  5. Mohammad Abdullah Matin Khan (2 papers)
  6. Haidar Khan (21 papers)
  7. Israt Jahan (10 papers)
  8. Amran Bhuiyan (9 papers)
  9. Chee Wei Tan (28 papers)
  10. Md Rizwan Parvez (24 papers)
  11. Enamul Hoque (26 papers)
  12. Shafiq Joty (187 papers)
  13. Jimmy Huang (9 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.