Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing (2209.13136v1)

Published 27 Sep 2022 in cs.CL, cond-mat.mtrl-sci, and cond-mat.soft

Abstract: The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used NLP methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a LLM, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at https://polymerscholar.org which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Pranav Shetty (8 papers)
  2. Arunkumar Chitteth Rajan (2 papers)
  3. Christopher Kuenneth (7 papers)
  4. Sonkakshi Gupta (1 paper)
  5. Lakshmi Prerana Panchumarti (1 paper)
  6. Lauren Holm (1 paper)
  7. Chao Zhang (907 papers)
  8. Rampi Ramprasad (43 papers)
Citations (47)