Are LLMs Good Literature Review Writers? Evaluating the Literature Review Writing Ability of Large Language Models (2412.13612v1)

Published 18 Dec 2024 in cs.CL and cs.AI

Abstract: The literature review is a crucial form of academic writing that involves complex processes of literature collection, organization, and summarization. The emergence of LLMs has introduced promising tools to automate these processes. However, their actual capabilities in writing comprehensive literature reviews remain underexplored, such as whether they can generate accurate and reliable references. To address this gap, we propose a framework to assess the literature review writing ability of LLMs automatically. We evaluate the performance of LLMs across three tasks: generating references, writing abstracts, and writing literature reviews. We employ external tools for a multidimensional evaluation, which includes assessing hallucination rates in references, semantic coverage, and factual consistency with human-written context. By analyzing the experimental results, we find that, despite advancements, even the most sophisticated models still cannot avoid generating hallucinated references. Additionally, different models exhibit varying performance in literature review writing across different disciplines.

PDF HTML Abstract

Evaluating the Literature Review Writing Capabilities of LLMs

The paper "Are LLMs Good Literature Review Writers? Evaluating the Literature Review Writing Ability of LLMs" critically examines the performance of LLMs in automating the literature review process, a task that predominantly involves the collection, organization, and summarization of extensive academic literature. Given the significant advancements of LLMs as tools in natural language processing, the authors aim to establish a framework that benchmarks LLMs' effectiveness in three core tasks: reference generation, abstract writing, and literature review composition.

Framework and Methodology

The authors propose a structured evaluation framework across multiple dimensions to evaluate LLMs' competencies. These dimensions include hallucination rates in generated references, semantic coverage, and factual consistency when compared to human-produced content. The framework consists of three primary tasks and leverages human-authored reviews as a gold standard for comparison.

Task Design: The research establishes distinct tasks for evaluating reference generation, abstract writing, and literature review writing. These tasks are designed to mirror the critical components of the academic literature review process. Each task prompts LLMs to generate text based on different inputs, such as article titles, keywords, or abstracts, ensuring a comprehensive assessment of the models' generative capabilities.
Dataset and Evaluation Metrics: A dataset compiled from 51 journals in domains spanning Biology, Chemistry, Mathematics, Physics, Social Sciences, and Technology forms the core of the evaluation. Key metrics include reference accuracy and title search rate, alongside semantic similarity and factual consistency metrics, to provide a well-rounded assessment of LLM outputs.

Results and Analysis

The paper involves four LLMs: Claude-3.5-Sonnet, GPT-4o, Qwen-2.5-72B-Instruct, and Llama-3.2-3B-Instruct. Across these models, the research identifies persistent issues of hallucination—fabricated or inaccurate references—despite advancements in LLM technology. Notably, Claude-3.5-Sonnet consistently displays superior performance across tasks, particularly in generating accurate references, which is likely influenced by differences in training data exposure.

The paper also highlights domain-specific variations in performance. For reference generation tasks, models generally perform better in Mathematics and Social Sciences, but exhibit lower accuracy in fields like Chemistry and Technology. Meanwhile, abstract writing shows a different inconsistency: factual consistency is notably lower in Social Sciences.

Implications and Future Directions

The findings of this paper offer several implications for future developments in AI and academic writing support tools. The consistent issue of hallucinations suggests a critical need for integrating reliable citation verification mechanisms. Improving the training datasets with more diverse, accurate, and up-to-date data could potentially mitigate these issues. Moreover, introducing domain-specific fine-tuning could enhance LLM proficiency in disparate academic fields.

Practically, while LLMs show capabilities that can reduce the manual workload involved in literature reviews, the paper cautions against deploying these models as sole authors of academic reviews without human oversight. Enhanced LLM reliability and reference accuracy are paramount for broader acceptance in academic circles.

In conclusion, this work provides a foundation for future AI research in academic writing and emphasizes the necessity for continual improvement of LLM capabilities, ensuring that they evolve from capable assistants into reliable autonomous agents in scientific inquiry. The paper's methodological rigor and comprehensive framework offer a valuable resource for evaluating and benchmarking LLMs, guiding both future research and practical applications in the academic community.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Xuemei Tang (10 papers)
Xufeng Duan (9 papers)
Zhenguang G. Cai (10 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1871613572838776956