Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
60 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science (2310.08511v1)

Published 12 Oct 2023 in cs.CL, cond-mat.mtrl-sci, and cs.AI

Abstract: We propose an instruction-based process for trustworthy data curation in materials science (MatSci-Instruct), which we then apply to finetune a LLaMa-based LLM targeted for materials science (HoneyBee). MatSci-Instruct helps alleviate the scarcity of relevant, high-quality materials science textual data available in the open literature, and HoneyBee is the first billion-parameter LLM specialized to materials science. In MatSci-Instruct we improve the trustworthiness of generated data by prompting multiple commercially available LLMs for generation with an Instructor module (e.g. Chat-GPT) and verification from an independent Verifier module (e.g. Claude). Using MatSci-Instruct, we construct a dataset of multiple tasks and measure the quality of our dataset along multiple dimensions, including accuracy against known facts, relevance to materials science, as well as completeness and reasonableness of the data. Moreover, we iteratively generate more targeted instructions and instruction-data in a finetuning-evaluation-feedback loop leading to progressively better performance for our finetuned HoneyBee models. Our evaluation on the MatSci-NLP benchmark shows HoneyBee's outperformance of existing LLMs on materials science tasks and iterative improvement in successive stages of instruction-data refinement. We study the quality of HoneyBee's LLMing through automatic evaluation and analyze case studies to further understand the model's capabilities and limitations. Our code and relevant datasets are publicly available at \url{https://github.com/BangLab-UdeM-Mila/NLP4MatSci-HoneyBee}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Constitutional ai: Harmlessness from ai feedback.
  2. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332.
  3. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Palm: Scaling language modeling with pathways.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  7. Research on text mining of material science based on natural language processing. In IOP conference series: materials science and engineering, volume 768, page 072094. IOP Publishing.
  8. DiSCoMaT: Distantly supervised composition extraction from tables in materials science articles. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13465–13483, Toronto, Canada. Association for Computational Linguistics.
  9. Matscibert: A materials domain language model for text mining and information extraction. npj Computational Materials, 8(1):1–11.
  10. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  11. Lora: Low-rank adaptation of large language models.
  12. Shu Huang and Jacqueline M Cole. 2022. Batterybert: A pretrained language model for battery database enhancement. Journal of Chemical Information and Modeling.
  13. Opportunities and challenges of text mining in materials research. Iscience, 24(3):102155.
  14. The power of scale for parameter-efficient prompt tuning.
  15. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
  16. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge.
  17. Gpt understands, too.
  18. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  19. Data-driven materials research enabled by natural language processing and information extraction. Applied Physics Reviews, 7(4):041317.
  20. OpenAI. 2022. openaiintroducingchatgpt. https://openai.com/blog/chatgpt. [Accessed 22-Jun-2023].
  21. OpenAI. 2023. Gpt-4 technical report.
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  23. Scaling language models: Methods, analysis & insights from training gopher.
  24. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  25. Matsci-nlp: Evaluating scientific language models on materials science language tasks using text-to-schema modeling. arXiv preprint arXiv:2305.08264.
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  27. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  28. The impact of domain-specific pre-training on named entity recognition tasks in materials science. Available at SSRN 3950755.
  29. Huatuo: Tuning llama model with chinese medical knowledge.
  30. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  31. Glm-130b: An open bilingual pre-trained model.
  32. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
Citations (13)

Summary

We haven't generated a summary for this paper yet.