Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation (2502.05151v2)

Published 7 Feb 2025 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: With the advent of large multimodal LLMs, science is now at a threshold of an AI-based technological transformation. Recently, a plethora of new AI models and tools has been proposed, promising to empower researchers and academics worldwide to conduct their research more effectively and efficiently. This includes all aspects of the research cycle, especially (1) searching for relevant literature; (2) generating research ideas and conducting experimentation; generating (3) text-based and (4) multimodal content (e.g., scientific figures and diagrams); and (5) AI-based automatic peer review. In this survey, we provide an in-depth overview over these exciting recent developments, which promise to fundamentally alter the scientific research process for good. Our survey covers the five aspects outlined above, indicating relevant datasets, methods and results (including evaluation) as well as limitations and scope for future research. Ethical concerns regarding shortcomings of these tools and potential for misuse (fake science, plagiarism, harms to research integrity) take a particularly prominent place in our discussion. We hope that our survey will not only become a reference guide for newcomers to the field but also a catalyst for new AI-based initiatives in the area of "AI4Science".

PDF Abstract

The paper offers an extensive technical survey that examines the integration of LLMs (LLM) and multimodal foundation models into the scientific research process. It presents a structured review across five key aspects of the research cycle, describing state-of-the-art methodologies, associated datasets, empirical results, and ethical concerns with an in-depth technical analysis.

Overview and Scope

The survey systematically categorizes AI-assisted scientific workflows into five central areas:

Literature search, summarization, and comparison
Hypothesis generation, idea formation, and automated experimentation
Text-based content generation (including title, abstract, related work, citation generation, proofreading, and paraphrasing)
Multimodal content generation and understanding (covering figure, table, slide, and poster generation and reasoning)
AI-assisted peer review processes

The paper emphasizes how recent advances in NLP and computer vision have enabled automated, context-aware systems that not only retrieve and synthesize scientific literature but also support the generation of novel hypotheses and scientific artifacts.

Literature Search, Summarization, and Comparison

The review delineates a detailed taxonomy of literature retrieval systems. It contrasts traditional search engines based on keyword matching with AI-enhanced systems leveraging semantic search through techniques such as retrieval-augmented generation (RAG). In particular, the survey:

Categorizes search systems into six types: classic search engines, AI-enhanced search systems, graph-based systems, interactive paper chat platforms, recommender systems, and benchmarking platforms.
Discusses the construction and use of large-scale databases and publisher repositories (open access, subscription-based, hybrid, institutional, governmental, and grey literature repositories) to underpin these systems.
Reviews quantitative evaluation metrics and qualitative analyses, highlighting that while citation analysis and structured summaries are common, these systems face limitations such as insufficient personalization, incomplete context capture, and issues arising from biases in training data.

Hypothesis Generation, Idea Formation, and Automated Experimentation

The survey reviews various methods where LLMs are employed for generating research hypotheses and ideas. It addresses challenges such as handling long context inputs, mitigating hallucinations, and iterative refinement strategies. Key contributions include:

Pipelines that first retrieve related works to ground hypothesis generation, enhancing alignment with scientifically validated discoveries.
Strategies using few-shot learning, fine-tuning, reinforcement learning, and multi-agent frameworks to iteratively refine generated hypotheses and improve testability.
Discussion of several datasets—ranging from curated paper abstracts (e.g., from ACL Anthology) to domain-specific collections in Chemistry, Social Science, and Medicine—used to benchmark idea and hypothesis generation.
An extensive review of automated experimentation methods that integrate hyperparameter tuning, multi-agent systems, and tree search algorithms to design, execute, and evaluate experimental protocols in domains such as AutoML and drug discovery.
A rigorous presentation of evaluation practices comparing generated hypotheses against gold standards using metrics like BLEU, BERTScore, and specialized LLM-based evaluation accompanied by human expert assessments.

Text-Based Content Generation

The paper addresses the generation of key textual elements in scientific papers. This encompasses title and abstract generation, related work synthesis, citation generation, and even press release drafting. The key technical insights include:

Transformer-based architectures (e.g., BART, GPT2, T5) that generate summaries and textual content, with detailed discussion on approaches that produce humorous versus conventional titles.
Detailed evaluations demonstrating that while LLMs are capable of generating coherent long texts, they still lag in preserving factual consistency and correctly synthesizing bibliographic details.
Comparative results showing that while human evaluations rate readability and style improvements when using LLM-based proofreading and paraphrasing techniques, issues such as citation hallucinations remain significant—a gap especially evident when assessing proprietary variants of GPT-3.5 versus GPT-4.

Multimodal Content Generation and Understanding

This section surveys the state of the art in generating and interpreting non-textual scientific artifacts such as figures, tables, slides, and posters. The technical discussion centers on:

Datasets that capture multimodal pairs—including text captions aligned with TikZ code for figures, triplets of figures and code (e.g., for chart generation), and paired datasets for slides and posters.
Approaches that treat figure generation as a code-generation problem (e.g., generating TikZ or Python scripts) where techniques like Monte Carlo Tree Search (MCTS) are used to enforce structural constraints in outputs.
Benchmarks for figure understanding that are framed as visual question-answering tasks, where systems must reason over spatial, numerical, and attribute dimensions. Automatic metrics (e.g., DreamSim, ClipScore) and LLM-based evaluations are discussed, noting gaps between open-source models and proprietary systems in achieving human-comparable performance.
Discussion of table understanding methods involving serialization of tables into linearized sequences versus structure-aware techniques that incorporate graph-based constraints, along with emerging approaches for text-to-table generation that support structured data extraction from lengthy scientific texts.
Slide and poster generation methodologies that range from rule-based extraction methods to sequence-to-sequence models, combining dense retrieval strategies with extraction and summarization to generate presentation content that mirrors a paper’s structure and intent.

Ethical Considerations and Limitations

The survey devotes considerable discussion to the ethical ramifications of deploying AI in scientific workflows. It emphasizes that:

The risk of reinforcing biases and the possibility of AI hallucinating or fabricating information, particularly in citing references while generating text, necessitate robust human oversight and improved transparency in model outputs.
The use of LLMs can inadvertently promote a convergence toward homogenized scientific writing, potentially stifling innovation in underrepresented areas.
In the domain of multimodal content, technical limitations—such as misalignment between generated figures or tables and the underlying scientific context—pose risks of misinformation or decreased clarity.
Ethical concerns extend into peer review processes where automated systems must balance efficiency with fairness and accountability.

A recurring theme across all sections is the importance of integrating human experts in the loop to verify and refine AI-generated content, ensuring reproducibility, factual accuracy, and adherence to disciplinary standards.

Concluding Remarks

Overall, the paper serves as a comprehensive reference that not only categorizes and evaluates existing AI-based methods across the research cycle but also identifies gaps—both technical and ethical—that warrant further investigation. Its technical focus, complete with detailed empirical observations and critical numerical results (e.g., improvements in novelty scores, rates of citation fabrication between models, and performance discrepancies between proprietary and open-source systems), makes it a valuable resource for researchers looking to systematically integrate AI into scientific discovery processes.

PDF Markdown Bookmark Chat (Pro)

Authors (14)

Steffen Eger (90 papers)
Yong Cao (33 papers)
Jennifer D'Souza (49 papers)
Andreas Geiger (136 papers)
Christian Greisinger (2 papers)
Stephanie Gross (3 papers)
Yufang Hou (49 papers)
Brigitte Krenn (4 papers)
Anne Lauscher (58 papers)
Yizhi Li (43 papers)
Chenghua Lin (127 papers)
Nafise Sadat Moosavi (38 papers)
Wei Zhao (309 papers)
Tristan Miller (7 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/chenghua_lin/status/1889283303083024505

https://twitter.com/fly51fly/status/1890889991728886016

https://twitter.com/jwt0625/status/1890604816285536340

https://twitter.com/jmsunico/status/1937315570233299114

https://twitter.com/memialabs/status/1894469933406282155

YouTube

Show All Videos