Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension (2403.08192v1)

Published 13 Mar 2024 in cs.CL and q-bio.BM

Abstract: LLMs are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel question answering (QA) dataset which possesses 62K QA pairs over 23K molecules. Each QA pair, composed of a manual question, a positive option and three negative options, has consistent semantics with a molecular description from authoritative molecular corpus. MoleculeQA is not only the first benchmark for molecular factual bias evaluation but also the largest QA dataset for molecular research. A comprehensive evaluation on MoleculeQA for existing molecular LLMs exposes their deficiencies in specific areas and pinpoints several particularly crucial factors for molecular understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. CAMEO Chemicals. https://cameochemicals.noaa.gov/.
  2. MolGPT: Molecular generation using a transformer-decoder model. Journal of chemical information and modeling.
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In IEEvaluation@ACL.
  4. SciBERT: A pretrained language model for scientific text. In Conference on Empirical Methods in Natural Language Processing.
  5. Pythia: A suite for analyzing large language models across training and scaling. ArXiv, abs/2304.01373.
  6. Andrés M Bran and Philippe Schwaller. 2023. Transformers and large language models for chemistry and drug discovery. ArXiv, abs/2310.06083.
  7. Language models are few-shot learners. ArXiv, abs/2005.14165.
  8. InstructMol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  10. Unifying molecular and textual representations via multi-task language modelling. In International Conference on Machine Learning.
  11. Reoptimization of mdl keys for use in drug discovery. Journal of chemical information and computer sciences, 42 6:1273–80.
  12. Translation between molecules and natural language. ArXiv, abs/2204.11817.
  13. Text2Mol: Cross-modal molecule retrieval with natural language queries. In Conference on Empirical Methods in Natural Language Processing.
  14. Molecular representation learning with language models and domain-relevant auxiliary tasks. ArXiv, abs/2011.13230.
  15. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. ArXiv, abs/2306.08018.
  16. Measuring massive multitask language understanding. ArXiv, abs/2009.03300.
  17. LoRA: Low-rank adaptation of large language models. ArXiv, abs/2106.09685.
  18. Strategies for pre-training graph neural networks. arXiv: Learning.
  19. Mixtral of experts. ArXiv, abs/2401.04088.
  20. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. ArXiv, abs/2009.13081.
  21. PubMedQA: A dataset for biomedical research question answering. In Conference on Empirical Methods in Natural Language Processing.
  22. Pubchem 2023 update. Nucleic acids research.
  23. RACE: Large-scale reading comprehension dataset from examinations. Cornell University - arXiv,Cornell University - arXiv.
  24. The bigscience roots corpus: A 1.6tb composite multilingual dataset. ArXiv, abs/2303.03915.
  25. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36:1234 – 1240.
  26. Empowering molecule discovery for molecule-caption translation with large language models: A ChatGPT perspective. ArXiv, abs/2306.06615.
  27. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
  28. DrugChat: Towards enabling chatgpt-like capabilities on drug molecule graphs. ArXiv, abs/2309.03907.
  29. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Annual Meeting of the Association for Computational Linguistics.
  30. Multi-modal molecule structure-text model for text-based retrieval and editing. ArXiv, abs/2212.10789.
  31. Automatic taxonomy construction from keywords. In Knowledge Discovery and Data Mining.
  32. MolXPT: Wrapping molecules with text for generative pre-training. ArXiv, abs/2305.10688.
  33. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In Conference on Empirical Methods in Natural Language Processing.
  34. S2orc: The semantic scholar open research corpus. In Annual Meeting of the Association for Computational Linguistics.
  35. Learn to explain: Multimodal reasoning via thought chains for science question answering. ArXiv, abs/2209.09513.
  36. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics.
  37. MolFM: A multimodal molecular foundation model. ArXiv, abs/2307.09484.
  38. BioMedGPT: Open multimodal generative pre-trained transformer for biomedicine. ArXiv, abs/2308.09442.
  39. emphLotus base: An integrated information portal for the model legume emphLotus japonicus. Sci Rep, 6:39447.
  40. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Conference on Computational Natural Language Learning.
  41. OpenAI. 2023a. ChatGPT: A language model for conversational ai.
  42. OpenAI. 2023b. GPT-4 technical report. ArXiv, abs/2303.08774.
  43. MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. In ACM Conference on Health, Inference, and Learning.
  44. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02.
  45. BioRead: A new dataset for biomedical reading comprehension. In International Conference on Language Resources and Evaluation.
  46. BioT5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. ArXiv, abs/2310.07276.
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  48. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
  49. Question-driven summarization of answers to consumer health questions. Scientific Data, 7.
  50. Teven Le Scao. 2022. Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100.
  51. Get your atoms in order - an open-source implementation of a novel and robust molecular canonicalization algorithm. Journal of chemical information and modeling, 55 10:2111–20.
  52. Zinc 15 – ligand discovery for everyone. Journal of Chemical Information and Modeling, 55:2324 – 2337.
  53. A molecular multimodal foundation model associating molecule graphs with natural language. ArXiv, abs/2209.05481.
  54. Galactica: A large language model for science. ArXiv, abs/2211.09085.
  55. LLaMA: Open and efficient foundation language models. ArXiv, abs/2302.13971.
  56. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  57. NewsQA: A machine comprehension dataset. In Rep4NLP@ACL.
  58. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16.
  59. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  60. Chemistryqa: A complex question answering dataset from chemistry.
  61. T3db: the toxic exposome database. Nucleic Acids Research, 43:D928 – D934.
  62. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Research, 46:D1074 – D1082.
  63. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
  64. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. ArXiv, abs/2311.16502.
  65. MaScQA: A question answering dataset for investigating materials science knowledge of large language models. ArXiv, abs/2308.09115.
  66. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature Communications, 13.
  67. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068.
  68. When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law.
  69. JEC-QA: A legal-domain question answering dataset. ArXiv, abs/1911.12011.
Citations (1)

Summary

We haven't generated a summary for this paper yet.