Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts (2308.10410v4)

Published 21 Aug 2023 in cs.CL

Abstract: Educational materials such as survey articles in specialized fields like computer science traditionally require tremendous expert inputs and are therefore expensive to create and update. Recently, LLMs have achieved significant success across various general tasks. However, their effectiveness and limitations in the education domain are yet to be fully explored. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science, focusing on a curated list of 99 topics. Automated benchmarks reveal that GPT-4 surpasses its predecessors, inluding GPT-3.5, PaLM2, and LLaMa2 by margins ranging from 2% to 20% in comparison to the established ground truth. We compare both human and GPT-based evaluation scores and provide in-depth analysis. While our findings suggest that GPT-created surveys are more contemporary and accessible than human-authored ones, certain limitations were observed. Notably, GPT-4, despite often delivering outstanding content, occasionally exhibited lapses like missing details or factual errors. At last, we compared the rating behavior between humans and GPT-4 and found systematic bias in using GPT evaluation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Fan Gao (40 papers)
  2. Hang Jiang (20 papers)
  3. Moritz Blum (6 papers)
  4. Jinghui Lu (28 papers)
  5. Dairui Liu (9 papers)
  6. Yuang Jiang (12 papers)
  7. Irene Li (47 papers)
  8. Rui Yang (221 papers)
  9. Qingcheng Zeng (30 papers)
  10. Tianwei She (6 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com