Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MARG: Multi-Agent Review Generation for Scientific Papers (2401.04259v1)

Published 8 Jan 2024 in cs.CL
MARG: Multi-Agent Review Generation for Scientific Papers

Abstract: We study the ability of LLMs to generate feedback for scientific papers and develop MARG, a feedback generation approach using multiple LLM instances that engage in internal discussion. By distributing paper text across agents, MARG can consume the full text of papers beyond the input length limitations of the base LLM, and by specializing agents and incorporating sub-tasks tailored to different comment types (experiments, clarity, impact) it improves the helpfulness and specificity of feedback. In a user study, baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time, and only 1.7 comments per paper were rated as good overall in the best baseline. Our system substantially improves the ability of GPT-4 to generate specific and helpful feedback, reducing the rate of generic comments from 60% to 29% and generating 3.7 good comments per paper (a 2.2x improvement).

Overview

The multi-agent review generation method, MARG-S, has introduced a means to tackle one of the recent challenges posed by the limitations of LLMs such as GPT-4. This innovative approach delegates the task of generating peer review feedback on scientific papers across multiple instances of a LLM. By distributing the text among several "agents," each handling a fragment and communicating with others, MARG-S can handle longer texts effectively. It enhances specificity and helpfulness in feedback by specializing agents to focus on specific aspects of critique such as experimentation, clarity, and impact.

System Design

MARG-S's architecture consists of a designated leader agent orchestrating the process with multiple worker agents, each provided with a section of the scientific paper, and specialized expert agents focusing on different review aspects. The coordination relies on a communication protocol, allowing agents to exchange messages to gather insights across the paper's entirety. The method also includes a crucial refinement stage where initial feedback undergoes a polishing process, improving clarity and ensuring comments are contextually relevant prior to presenting to the user.

User Study Evaluation

In the MARG-S evaluation through a user paper, the multi-agent approach showed a remarkable improvement in the quality of generated comments compared with the baseline methods. Feedback from users suggested MARG-S offered specific, accurate, and actionable suggestions. However, while MARG-S surpassed other methods in producing "good" comments, broadly beneficial improvements are still possible, indicated by a notable proportion of comments being deemed as "bad" or "highly inaccurate" across all methods.

Potential and Challenges

The introduction of MARG-S into the domain of scientific review generation reflects a promising leap forward. It not only showcases an advanced application of LLMs but also exhibits a potential model for future enhancement of AI-driven peer-review systems. The increase in the cost of running such multi-agent systems, however, points toward a significant consideration for practical deployment. Future iterations of MARG-S will benefit from optimization for cost and efficiency, the inclusion of related literature for more informed reviews, and advancements in managing the agent communication to handle even larger inputs without overwhelming the system’s capacity. With further refinement, systems like MARG-S could significantly aid scientific communities in the review process, offering more comprehensive, insightful feedback to authors and potentially reshaping the peer review landscape.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Deep learning in citation recommendation models survey. Expert Systems with Applications, 162:113790.
  2. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. ArXiv:2308.14508 [cs].
  3. Setio Basuki and Masatoshi Tsuchiya. 2022. The Quality Assist: A Technology-Assisted Peer Review Based on Citation Functions to Predict the Paper Quality. IEEE Access, 10:126815–126831. Conference Name: IEEE Access.
  4. Package ‘lme4’. URL http://lme4. r-forge. r-project. org.
  5. Longformer: The Long-Document Transformer.
  6. Writer-Defined AI Personas for On-Demand Feedback Generation. ArXiv:2309.10433 [cs].
  7. PEERRec: An AI-based approach to automatically generate recommendations and predict decisions in peer review. International Journal on Digital Libraries.
  8. Rune Haubo Bojesen Christensen. 2015. Package ‘ordinal’. Stand, 19(2016). Publisher: Citeseer.
  9. Aries: A corpus of scientific paper edits made in response to peer reviews.
  10. Improving Factuality and Reasoning in Language Models through Multiagent Debate. ArXiv:2305.14325 [cs].
  11. Nell K Duke and P David Pearson. 2009. Effective practices for developing reading comprehension. Journal of education, 189(1-2):107–122. Publisher: SAGE Publications Sage CA: Los Angeles, CA.
  12. Linda G. Fielding and And Others. 1990. How Discussion Questions Influence Children’s Story Understanding. Technical Report No. 490. Technical report. ERIC Number: ED314724.
  13. Raymond Fok and Daniel S Weld. 2023. What can’t large language models do? the future of ai-assisted academic writing. In In2Writing Workshop at CHI 2023.
  14. Frederic Gmeiner and Nur Yildirim. 2023. Dimensions for Designing LLM-based Writing Support. In In2Writing Workshop at CHI 2023.
  15. MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. ArXiv:2308.00352 [cs].
  16. Efficient Long-Text Understanding with Short-Text Models. Transactions of the Association for Computational Linguistics, 11:284–299.
  17. Screening for Self-Plagiarism in a Subspecialty-versus-General Imaging Journal Using iThenticate. American Journal of Neuroradiology, 36(6):1034–1038. Publisher: American Journal of Neuroradiology Section: Editorial Perspectives.
  18. Reformer: The Efficient Transformer.
  19. Kayvan Kousha and Mike Thelwall. 2023. Artificial intelligence to support publishing and peer review: A summary and review. Learned Publishing, n/a(n/a). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/leap.1570.
  20. CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society. ArXiv:2303.17760 [cs].
  21. Can large language models provide useful feedback on research papers? A large-scale empirical analysis. ArXiv:2310.01783 [cs].
  22. Lost in the Middle: How Language Models Use Long Contexts. ArXiv:2307.03172 [cs].
  23. Ryan Liu and Nihar B. Shah. 2023. ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing. ArXiv:2306.00622 [cs].
  24. Cerstin Mahlow. 2023. Writing Tools: Looking Back to Look Ahead. In In2Writing Workshop at CHI 2023. ArXiv:2303.17894 [cs].
  25. Michèle B. Nuijten and Joshua R. Polanin. 2020. “statcheck”: Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods, 11(5):574–579.
  26. OpenAI. 2023. Gpt-4 technical report.
  27. Afshin Oroojlooy and Davood Hajinezhad. 2022. A review of cooperative multi-agent deep reinforcement learning. Applied Intelligence, 53(11):13677–13722.
  28. Generative Agents: Interactive Simulacra of Human Behavior. ArXiv:2304.03442 [cs].
  29. The NLP Task Effectiveness of Long-Range Transformers. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3774–3790, Dubrovnik, Croatia. Association for Computational Linguistics.
  30. Beyond summarization: Designing ai support for real-world expository writing tasks. In In2Writing Workshop at CHI 2023.
  31. Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models. ArXiv:2308.15022 [cs].
  32. ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis. arXiv:2010.06119 [cs]. ArXiv: 2010.06119.
  33. Linformer: Self-Attention with Linear Complexity. ArXiv:2006.04768 [cs, stat].
  34. Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. ArXiv:2307.05300 [cs].
  35. Number Preference, Precision and Implicit Confidence. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33. Issue: 33.
  36. Recursively Summarizing Books with Human Feedback. ArXiv:2109.10862 [cs].
  37. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. ArXiv:2308.08155 [cs].
  38. Memorizing Transformers.
  39. Retrieval meets Long Context Large Language Models. ArXiv:2310.03025 [cs].
  40. Weizhe Yuan and Pengfei Liu. 2022. KID-Review: Knowledge-Guided Scientific Review Generation with Oracle Pre-Training. In Proceedings of the First MiniCon Conference.
  41. Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms. In Kyriakos G. Vamvoudakis, Yan Wan, Frank L. Lewis, and Derya Cansever, editors, Handbook of Reinforcement Learning and Control, Studies in Systems, Decision and Control, pages 321–384. Springer International Publishing, Cham.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mike D'Arcy (8 papers)
  2. Tom Hope (41 papers)
  3. Larry Birnbaum (7 papers)
  4. Doug Downey (50 papers)
Citations (13)