Information Extraction from Historical Well Records Using A Large Language Model (2405.05438v1)
Abstract: To reduce environmental risks and impacts from orphaned wells (abandoned oil and gas wells), it is essential to first locate and then plug these wells. Although some historical documents are available, they are often unstructured, not cleaned, and outdated. Additionally, they vary widely by state and type. Manual reading and digitizing this information from historical documents are not feasible, given the high number of wells. Here, we propose a new computational approach for rapidly and cost-effectively locating these wells. Specifically, we leverage the advanced capabilities of LLMs to extract vital information including well location and depth from historical records of orphaned wells. In this paper, we present an information extraction workflow based on open-source Llama 2 models and test them on a dataset of 160 well documents. Our results show that the developed workflow achieves excellent accuracy in extracting location and depth from clean, PDF-based reports, with a 100% accuracy rate. However, it struggles with unstructured image-based well records, where accuracy drops to 70%. The workflow provides significant benefits over manual human digitization, including reduced labor and increased automation. In general, more detailed prompting leads to improved information extraction, and those LLMs with more parameters typically perform better. We provided a detailed discussion of the current challenges and the corresponding opportunities/approaches to address them. Additionally, a vast amount of geoscientific information is locked up in old documents, and this work demonstrates that recent breakthroughs in LLMs enable us to unlock this information more broadly.
- J. Boutot, A. S. Peltz, R. McVay, and M. Kang, “Documented orphaned oil and gas wells across the United States,” Environmental Science & Technology, vol. 56, no. 20, pp. 14228–14236, 2022.
- D. O’Malley, A. Delorey, E. Guiltinan, Z. Ma, T. Kadeethum, G. Lackey, J. Lee, E. Emily Follansbee, M. Nair, N. Pekney, M. Mehana, P. Hora, J. W. Carey, C. Varadharajan, F. Ciulla, S. Biraud, P. Jordan, M. Dubey, Y. Wu, I. Jahan, M. Dubey, C. Weiss, J. Boutot, M. Kang, A. Govert, and H. Viswanathan, “The undocumented orphan well challenge: An interdisciplinary opportunity to achieve sustainability,” Environmental Science & Technology, (Under Review).
- M. D. Merrill, C. A. Grove, N. J. Gianoutsos, and P. A. Freeman, “Analysis of the United States documented unplugged orphaned oil and gas well dataset,” Technical Report from US Geological Survey, 2023.
- IOGCC, “Idle and orphan oil and gas wells: State and provincial regulatory strategies 2021,” Technical Report from Interstate Oil and Gas Compact Commission (IOGCC), 2021.
- U. S. E. P. Agency, “Inventory of U.S. greenhouse gas emissions and sinks: 1990-2020,” Technical Report from United States Environmental Protection Agency (EPA), 2022.
- M. Kang, J. Boutot, R. C. McVay, K. A. Roberts, S. Jasechko, D. Perrone, T. Wen, G. Lackey, D. Raimi, D. C. Digiulio, S. B. C. Shonkoff, J. William Carey, E. G. Elliott, D. J. Vorhees, and A. S. Peltz, “Environmental risks and opportunities of orphaned oil and gas wells in the United States,” Environmental Research Letters, vol. 18, p. 074012, July 2023.
- D. Raimi, A. J. Krupnick, J.-S. Shah, and A. Thompson, “Decommissioning orphaned and abandoned oil and gas wells: New estimates and cost drivers,” Environmental Science & Technology, vol. 55, no. 15, pp. 10224–10230, 2021.
- L. Eikvil, “Optical character recognition,” citeseer. ist. psu. edu/142042. html, vol. 26, 1993.
- Springer, 2017.
- Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al., “A survey on evaluation of large language models,” ACM Transactions on Intelligent Systems and Technology, 2023.
- O. Topsakal and T. C. Akinci, “Creating large language model applications utilizing langchain: A primer on developing llm apps fast,” International Conference on Applied Engineering and Natural Sciences, vol. 1, no. 1, pp. 1050–1056, 2023.
- Z. Ma, Y. D. Kim, O. Volkov, and L. J. Durlofsky, “Optimization of subsurface flow operations using a dynamic proxy strategy,” Mathematical Geosciences, vol. 54, no. 8, pp. 1261–1287, 2022.
- Z. Ma and J. Y. Leung, “Design of warm solvent injection processes for heterogeneous heavy oil reservoirs: A hybrid workflow of multi-objective optimization and proxy models,” Journal of Petroleum Science and Engineering, vol. 191, p. 107186, 2020.
- J. E. Santos, Z. R. Fox, A. Mohan, D. O’Malley, H. Viswanathan, and N. Lubbers, “Development of the senseiver for efficient field reconstruction from sparse observations,” Nature Machine Intelligence, vol. 5, no. 11, pp. 1317–1325, 2023.
- B. Zhang, Z. Ma, D. Zheng, R. J. Chalaturnyk, and J. Boisvert, “Upscaling shear strength of heterogeneous oil sands with interbedded shales using artificial neural network,” SPE Journal, vol. 28, no. 02, pp. 737–753, 2023.
- B. Yan, D. R. Harp, B. Chen, and R. J. Pawar, “Improving deep learning performance for predicting large-scale geological co 2 sequestration modeling through feature coarsening,” Scientific Reports, vol. 12, no. 1, p. 20667, 2022.
- Z. Ma, B. Chen, and R. J. Pawar, “Development of a machine learning-based proxy model for geologic CO2 storage operation – a field application,” in AGU Annual Meeting 2023, San Fransico, USA, 2023.
- S. Srinivasan, D. O’Malley, M. K. Mudunuru, M. R. Sweeney, J. D. Hyman, S. Karra, L. Frash, J. W. Carey, M. R. Gross, G. D. Guthrie, et al., “A machine learning framework for rapid forecasting and history matching in unconventional reservoirs,” Scientific Reports, vol. 11, no. 1, p. 21730, 2021.
- K. Gao and R. T. Modrak, “Machine learning inference of random medium properties,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–13, 2024.
- S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao, “Large language models: A survey,” arXiv preprint arXiv:2402.06196, 2024.
- S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large language models and knowledge graphs: A roadmap,” IEEE Transactions on Knowledge and Data Engineering, 2024.
- R. Koshkin, K. Sudoh, and S. Nakamura, “Transllama: Llm-based simultaneous translation system,” arXiv:2402.04636, 2024.
- X. Sun, X. Li, S. Zhang, S. Wang, F. Wu, J. Li, T. Zhang, and G. Wang, “Sentiment analysis through llm negotiations,” arXiv:2311.01876, 2023.
- Y. Zhuang, Y. Yu, K. Wang, H. Sun, and C. Zhang, “Toolqa: A dataset for llm question answering with external tools,” in Advances in Neural Information Processing Systems (A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds.), vol. 36, pp. 50117–50143, Curran Associates, Inc., 2023.
- Y. Wang, W. Wang, S. Joty, and S. C. H. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” arXiv:2109.00859, 2021.
- S. Shekhar, T. Dubey, K. Mukherjee, A. Saxena, A. Tyagi, and N. Kotla, “Towards optimizing the costs of llm usage,” arXiv:2402.01742, 2024.
- T. F. Tan, K. Elangovan, L. Jin, Y. Jie, L. Yong, J. Lim, S. Poh, W. Y. Ng, D. Lim, Y. Ke, N. Liu, and D. S. W. Ting, “Fine-tuning large language model (llm) artificial intelligence chatbots in ophthalmology and llm-based evaluation using GPT-4,” arXiv:2402.10083, 2024.
- E. Foroumandi, H. Moradkhani, X. Sanchez-Vila, K. Singha, A. Castelletti, and G. Destouni, “ChatGPT in hydrology and earth sciences: Opportunities, prospects, and concerns,” Water Resources Research, vol. 59, no. 10, p. e2023WR036288, 2023.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” arXiv:2302.13971, 2023.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and J. Han, “Large language models can self-improve,” arXiv preprint arXiv:2210.11610, 2022.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837, 2022.
- pdftotext, “pdftotext - portable document format (pdf) to text converter (version 3.00),” software available at https://linux.die.net/man/1/pdftotext, 2024.
- Google, “Google Document AI,” online tool available https://cloud.google.com/document-ai?hl=en, 2024.
- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., “Improving language understanding by generative pre-training,” OpenAI, 2018.
- TheBloke, “Llama 2 70b chat - gptq,” 2024.
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
- E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022.
- C. Karney, “Geographiclib,” online at https://geographiclib.sourceforge.io/oindex.html, 2015.
- V. Liu and L. B. Chilton, “Design guidelines for prompt engineering text-to-image generative models,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–23, 2022.
- L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” in Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7, 2021.
- A. Birhane, A. Kasirzadeh, D. Leslie, and S. Wachter, “Science in the age of large language models,” Nature Reviews Physics, pp. 1–4, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.