Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EnzChemRED, a rich enzyme chemistry relation extraction dataset (2404.14209v1)

Published 22 Apr 2024 in cs.CL

Abstract: Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of NLP methods such as (large) LLMs that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained LLMs with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Po-Ting Lai (14 papers)
  2. Elisabeth Coudert (1 paper)
  3. Lucila Aimo (1 paper)
  4. Kristian Axelsen (1 paper)
  5. Lionel Breuza (1 paper)
  6. Edouard de Castro (1 paper)
  7. Marc Feuermann (1 paper)
  8. Anne Morgat (2 papers)
  9. Lucille Pourcel (1 paper)
  10. Ivo Pedruzzi (1 paper)
  11. Sylvain Poux (1 paper)
  12. Nicole Redaschi (3 papers)
  13. Catherine Rivoire (1 paper)
  14. Anastasia Sveshnikova (1 paper)
  15. Chih-Hsuan Wei (16 papers)
  16. Robert Leaman (15 papers)
  17. Ling Luo (32 papers)
  18. Zhiyong Lu (113 papers)
  19. Alan Bridge (2 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com