ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision
Abstract: Structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design. Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts. Consequently, the scarcity of sufficient training data poses an obstacle to the progress of related models in this domain. In this paper, we propose ReactIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions. Additionally, we adopt synthetic data from patent records as distant supervision to incorporate domain knowledge into the model. Experiments demonstrate that ReactIE achieves substantial improvements and outperforms all existing baselines.
- Synthesis of 21, 23-selenium-and tellurium-substituted 5-porphomethenes, 5, 10-porphodimethenes, 5, 15-porphodimethenes, and porphotrimethenes and their interactions with mercury. The Journal of Organic Chemistry, 80(8):3880–3890.
- Adaprompt: Adaptive model training for prompt-based NLP. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 6057–6068. Association for Computational Linguistics.
- Chemberta: Large-scale self-supervised pretraining for molecular property prediction. CoRR, abs/2010.09885.
- Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
- Russell G Dushin and Samuel J Danishefsky. 1992. Total syntheses of ks-501, ks-502, and their enantiomers. Journal of the American Chemical Society, 114(2):655–659.
- Translation between molecules and natural language. CoRR, abs/2204.11817.
- Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230.
- Stephen Walter Gabrielson. 2018. Scifinder. Journal of the Medical Library Association: JMLA, 106(4):588.
- Jonathan M. Goodman. 2009. Computer software review: Reaxys. J. Chem. Inf. Model., 49(12):2897–2898.
- Automated chemical reaction extraction from scientific literature. J. Chem. Inf. Model., 62(9):2035–2045.
- Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.
- Metapad: Meta pattern discovery from massive text corpora. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 877–886. ACM.
- Open-vocabulary argument role prediction for event extraction. CoRR, abs/2211.01577.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinform., 36(4):1234–1240.
- Truepie: Discovering reliable patterns in pattern-based information extraction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages 1675–1684. ACM.
- Daniel Lowe. 2018. Chemical reactions from us patents (1976-sep2016). doi, 10:m9.
- Daniel M. Lowe. 2012. Extraction of chemical structures and reactions from the literature. Ph.D. thesis, University of Cambridge, UK.
- Self-attention based molecule representation for predicting drug-target interaction. In Proceedings of the Machine Learning for Healthcare Conference, MLHC 2019, 9-10 August 2019, Ann Arbor, Michigan, USA, volume 106 of Proceedings of Machine Learning Research, pages 230–248. PMLR.
- Matthew C. Swain and Jacqueline M. Cole. 2016. Chemdataextractor: A toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model., 56(10):1894–1904.
- Inferring experimental procedures from text-based representations of chemical reactions. Nature communications, 12(1):1–11.
- Automated extraction of chemical synthesis actions from experimental procedures. Nature communications, 11(1):1–11.
- Chemical-reaction-aware molecule representation learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- SMILES-BERT: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2019, Niagara Falls, NY, USA, September 7-10, 2019, pages 429–436. ACM.
- REACTCLASS: cross-modal supervision for subword-guided reactant entity classification. In IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Las Vegas, NV, USA, December 6-8, 2022, pages 844–847. IEEE.
- Chemner: Fine-grained chemistry named entity recognition with ontology-guided distant supervision. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 5227–5240. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.