Data-driven Discovery with Large Generative Models (2402.13610v1)
Abstract: With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery -- a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DATAVOYAGER, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata -- a feat previously unattainable -- while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.
- Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Bring your own kg: Self-supervised program synthesis for zero-shot kgqa. ArXiv, abs/2311.07850, 2023. URL https://api.semanticscholar.org/CorpusID:265158071.
- Artificial intelligence and scientific discovery: A model of prioritized search. SSRN Electronic Journal, 2023. URL https://api.semanticscholar.org/CorpusID:260906716.
- Social background, academic resources, and college graduation: Recent evidence from the national longitudinal survey. American Journal of Education, 90(4):315–333, 1982.
- Anderson, C. The end of theory: The data deluge makes the scientific method obsolete. Wired magazine, 16(7):16–07, 2008.
- Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 541–556, 2017.
- Bekhuis, T. Conceptual biology, hypothesis discovery, and text mining: Swanson’s legacy. Biomedical digital libraries, 3:1–7, 2006.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48, 2009.
- Artificial intelligence in science: An emerging general method of invention. Research Policy, 51(10):104604, 2022.
- Autonomous chemical research with large language models. Nature, 624:570 – 578, 2023. URL https://api.semanticscholar.org/CorpusID:266432059.
- An ontology-driven framework for data transformation in scientific workflows. In International Workshop on Data Integration in the Life Sciences, pp. 1–16. Springer, 2004.
- Google dataset search: Building a search engine for datasets in an open web ecosystem. The World Wide Web Conference, 2019. URL https://api.semanticscholar.org/CorpusID:86688027.
- Exploration by random network distillation. ArXiv, abs/1810.12894, 2018. URL https://api.semanticscholar.org/CorpusID:53115163.
- Large language models as tool makers. ArXiv, abs/2305.17126, 2023. URL https://api.semanticscholar.org/CorpusID:258947222.
- Semantics derived automatically from language corpora contain human-like biases. Science, 356:183 – 186, 2016. URL https://api.semanticscholar.org/CorpusID:23163324.
- Callison-Burch, C. Understanding generative artificial intelligence and its relationship to copyright. Testimony before The U.S. House of Representatives Judiciary Committee, Subcommittee on Courts, Intellectual Property, and the Internet, May 2023. Hearing on Artificial Intelligence and Intellectual Property: Part I – Interoperability of AI and Copyright Law.
- Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nature Human Behaviour, 2:637 – 644, 2018. URL https://api.semanticscholar.org/CorpusID:52098703.
- The rise of open science: Tracking the evolution and perceived value of data and methods link-sharing practices. ArXiv, abs/2310.03193, 2023. URL https://api.semanticscholar.org/CorpusID:263671521.
- Navigator: A gen-ai system for discovery of factual and predictive insights on domain-specific tabular datasets. Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD), 2024. URL https://api.semanticscholar.org/CorpusID:266743618.
- Citesee: Augmenting citations in scientific papers with persistent and personalized historical context. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023. URL https://api.semanticscholar.org/CorpusID:256868353.
- Training verifiers to solve math word problems. ArXiv, abs/2110.14168, 2021. URL https://api.semanticscholar.org/CorpusID:239998651.
- Collaboration, O. S. Reproducibility project: Psychology, 2015. URL https://doi.org/10.17605/OSF.IO/EZCUJ.
- Explaining answers with entailment trees. In Conference on Empirical Methods in Natural Language Processing, 2021. URL https://api.semanticscholar.org/CorpusID:233297051.
- Dunn, O. J. Multiple comparisons among means. Journal of the American Statistical Association, 56:52–64, 1961. URL https://api.semanticscholar.org/CorpusID:122009246.
- Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html.
- Conceptions of good science in our data-rich world. BioScience, 66(10):880–889, 2016.
- Dreamcoder: growing generalizable, interpretable knowledge with wake–sleep bayesian program learning. Philosophical Transactions of the Royal Society A, 381, 2020. URL https://api.semanticscholar.org/CorpusID:219687434.
- Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SJx63jRqFm.
- Fanelli, D. Opinion: Is science really facing a reproducibility crisis, and do we need it to? Proceedings of the National Academy of Sciences, 115:2628 – 2631, 2018. URL https://api.semanticscholar.org/CorpusID:4639856.
- AI2’s Response to the US Copyright Requence for Comments on Artificial Intelligence and Copyright. US Copyright Office Docket No. 2023-6, 2023. Comment.
- From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models. ArXiv, abs/2305.08283, 2023. URL https://api.semanticscholar.org/CorpusID:258686693.
- Efficient and robust automated machine learning. Advances in neural information processing systems, 28, 2015.
- Serendipity and information seeking: an empirical study. Journal of documentation, 59(3):321–340, 2003.
- Using semantic workflows to disseminate best practices and accelerate discoveries in multi-omic data analysis. In AAAI Conference on Artificial Intelligence, 2013. URL https://api.semanticscholar.org/CorpusID:15583030.
- Towards continuous scientific data analysis and hypothesis evolution. In AAAI Conference on Artificial Intelligence, 2017. URL https://api.semanticscholar.org/CorpusID:11269287.
- Towards capturing scientific reasoning to automate data analysis. 2022. URL https://api.semanticscholar.org/CorpusID:248914202.
- Google. Introducing duet ai for google workspace. https://workspace.google.com/blog/product-announcements/duet-ai, 2023. Accessed: 2024-02-18.
- Human detection of political speech deepfakes across transcripts, audio, and video. 2022. URL https://api.semanticscholar.org/CorpusID:259342907.
- Is curiosity all you need? on the utility of emergent behaviours from curious exploration. ArXiv, abs/2109.08603, 2021. URL https://api.semanticscholar.org/CorpusID:237563118.
- How do data analysts respond to ai assistance? a wizard-of-oz study. ArXiv, abs/2309.10108, 2023. URL https://api.semanticscholar.org/CorpusID:262054482.
- Hassabis, D. Using ai to accelerate scientific discovery, 2002. URL https://www.youtube.com/watch?v=jocWJiztxYA&ab_channel=InstituteforEthicsinAIOxford.
- Heaven, W. D. Why meta’s latest large language model survived only three days online. MIT Technology Review. Last accessed December, 15:2022, 2022.
- Measuring massive multitask language understanding. ArXiv, abs/2009.03300, 2020. URL https://api.semanticscholar.org/CorpusID:221516475.
- Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(6), 2012.
- A python package to calculate the olr-based index of the madden-julian-oscillation (omi) in climate science and weather forecasting. Journal of Open Research Software, 2021. URL https://api.semanticscholar.org/CorpusID:236586655.
- Large language models for software engineering: A systematic literature review. ArXiv, abs/2308.10620, 2023. URL https://api.semanticscholar.org/CorpusID:261048648.
- Vime: Variational information maximizing exploration. Advances in neural information processing systems, 29, 2016.
- Benchmarking large language models as ai research agents. ArXiv, abs/2310.03302, 2023. URL https://api.semanticscholar.org/CorpusID:263671541.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Kahneman, D. Thinking, fast and slow. macmillan, 2011.
- Llms can’t plan, but can help planning in llm-modulo frameworks. arXiv preprint arXiv:2402.01817, 2024.
- Noscope: Optimizing deep cnn-based queries over video streams at scale. Proc. VLDB Endow., 10:1586–1597, 2017. URL https://api.semanticscholar.org/CorpusID:20732104.
- Decomposed prompting: A modular approach for solving complex tasks. ArXiv, abs/2210.02406, 2022. URL https://api.semanticscholar.org/CorpusID:252715485.
- A practical guide to methods controlling false discoveries in computational biology. Genome Biology, 20, 2019. doi: 10.1186/s13059-019-1716-1. URL https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1716-1.
- A practical guide to methods controlling false discoveries in computational biology. Genome Biology, 20, 2018. URL https://api.semanticscholar.org/CorpusID:91264977.
- Interactive code generation via test-driven user-intent formalization. ArXiv, abs/2208.05950, 2022. URL https://api.semanticscholar.org/CorpusID:251492970.
- Langley, P. Bacon: A production system that discovers empirical laws. In International Joint Conference on Artificial Intelligence, 1977. URL https://api.semanticscholar.org/CorpusID:2320342.
- Langley, P. Data-driven discovery of physical laws. Cogn. Sci., 5:31–54, 1981. URL https://api.semanticscholar.org/CorpusID:39694251.
- Rediscovering chemistry with the bacon system. 1983. URL https://api.semanticscholar.org/CorpusID:118714327.
- The search for regularity: Four aspects of scientific discovery. 1984. URL https://api.semanticscholar.org/CorpusID:3155192.
- The cancer genomics cloud: collaborative, reproducible, and democratized—a new paradigm in large-scale computational research. Cancer research, 77(21):e3–e6, 2017.
- LeCun, Y. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1), 2022.
- Challenges in high-throughput inorganic material prediction and autonomous synthesis. 2024.
- Competition-level code generation with alphacode. Science, 378:1092 – 1097, 2022. URL https://api.semanticscholar.org/CorpusID:246527904.
- On continual model refinement in out-of-distribution data streams. In Annual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID:248512744.
- Visual instruction tuning. ArXiv, abs/2304.08485, 2023a. URL https://api.semanticscholar.org/CorpusID:258179774.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. ArXiv, abs/2305.01210, 2023b. URL https://api.semanticscholar.org/CorpusID:258437095.
- Agentbench: Evaluating llms as agents. ArXiv, abs/2308.03688, 2023c. URL https://api.semanticscholar.org/CorpusID:260682249.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. 2023. URL https://api.semanticscholar.org/CorpusID:264491155.
- Pynguin: Automated unit test generation for python. 2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 168–172, 2022. URL https://api.semanticscholar.org/CorpusID:246706202.
- Self-refine: Iterative refinement with self-feedback. ArXiv, abs/2303.17651, 2023. URL https://api.semanticscholar.org/CorpusID:257900871.
- Reproducibility in nlp: What have we learned from the checklist? In Annual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:259187997.
- Ask what’s missing and what’s useful: Improving clarification question generation using global knowledge. In North American Chapter of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:233231257.
- Achieving conversational goals with unsupervised post-hoc knowledge injection. ArXiv, abs/2203.11399, 2022. URL https://api.semanticscholar.org/CorpusID:247547046.
- Clin: A continually learning language agent for rapid task adaptation and generalization. ArXiv, abs/2310.10134, 2023. URL https://api.semanticscholar.org/CorpusID:264146262.
- Monroy, D. Introducing copilot support for python in excel: Advanced data analysis using natural language. https://techcommunity.microsoft.com/t5/excel-blog/introducing-copilot-support-for-python-in-excel-advanced-data/ba-p/3928120, 2023. Accessed: 2024-02-18.
- Computer science as empirical inquiry: symbols and search. Commun. ACM, 19(3):113–126, mar 1976. ISSN 0001-0782. doi: 10.1145/360018.360022. URL https://doi.org/10.1145/360018.360022.
- How can we define intrinsic motivation? 2008. URL https://api.semanticscholar.org/CorpusID:14217330.
- Intrinsic motivation systems for autonomous mental development. IEEE transactions on evolutionary computation, 11(2):265–286, 2007.
- Relatedly: Scaffolding literature reviews with existing related work sections. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023. URL https://api.semanticscholar.org/CorpusID:256846632.
- Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pp. 2778–2787. PMLR, 2017.
- Exploiting novel gpt-4 apis. ArXiv, abs/2312.14302, 2023. URL https://api.semanticscholar.org/CorpusID:266521205.
- nbiig: A neural bi insights generation system for table reporting. In AAAI Conference on Artificial Intelligence, 2022. URL https://api.semanticscholar.org/CorpusID:253397856.
- Adapt: As-needed decomposition and planning with language models. ArXiv, abs/2311.05772, 2023. URL https://api.semanticscholar.org/CorpusID:265128575.
- Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. ArXiv, abs/2310.08559, 2023. URL https://api.semanticscholar.org/CorpusID:263909078.
- Data mining: From serendipity to science. Computer, 32(8):34–37, 1999.
- Semantic web in data mining and knowledge discovery: A comprehensive survey. J. Web Semant., 36:1–22, 2016. URL https://api.semanticscholar.org/CorpusID:42846121.
- Mathematical discoveries from program search with large language models. Nature, 625:468 – 475, 2023. URL https://api.semanticscholar.org/CorpusID:266223700.
- surrosurv: An r package for the evaluation of failure time surrogate endpoints in individual patient data meta-analyses of randomized clinical trials. Computer methods and programs in biomedicine, 155:189–198, 2018. URL https://api.semanticscholar.org/CorpusID:3480478.
- Pandas-profiling now supports apache spark. https://www.databricks.com/blog/2023/04/03/pandas-profiling-now-supports-apache-spark.html, 2023. Accessed: 2024-02-18.
- An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50:85–105, 2023. URL https://api.semanticscholar.org/CorpusID:256827098.
- Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Yacmpz84TH.
- Automatic data transformation using large language model - an experimental study on building energy data. 2023 IEEE International Conference on Big Data (BigData), pp. 1824–1834, 2023. URL https://api.semanticscholar.org/CorpusID:261530167.
- Reflexion: Language agents with verbal reinforcement learning. 2023. URL https://api.semanticscholar.org/CorpusID:258833055.
- Are time preference and body mass index associated?: Evidence from the national longitudinal survey of youth. Economics & Human Biology, 3(2):259–270, 2005.
- Open-endedness: The last grand challenge you’ve never heard of. While open-endedness could be a force for discovering intelligence, it could also be a component of AI itself, 2017.
- Sql-palm: Improved large language model adaptation for text-to-sql. ArXiv, abs/2306.00739, 2023. URL https://api.semanticscholar.org/CorpusID:258999853.
- Swanson, D. R. Undiscovered public knowledge. The Library Quarterly, 56:103–118, 1986. URL https://api.semanticscholar.org/CorpusID:144270735.
- Taleb, N. N. The Black Swan: The Impact of the Highly Improbable. Random House Group, 2007. ISBN 1400063515.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023. URL https://api.semanticscholar.org/CorpusID:259950998.
- Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
- Complex embeddings for simple link prediction. ArXiv, abs/1606.06357, 2016. URL https://api.semanticscholar.org/CorpusID:15150247.
- Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. CHI Conference on Human Factors in Computing Systems Extended Abstracts, 2022. URL https://api.semanticscholar.org/CorpusID:247255943.
- Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. 2022. URL https://api.semanticscholar.org/CorpusID:249889477.
- Voyager: An open-ended embodied agent with large language models. ArXiv, abs/2305.16291, 2023a. URL https://api.semanticscholar.org/CorpusID:258887849.
- Learning to generate novel scientific directions with contextualized literature-based discovery. ArXiv, abs/2305.14259, 2023b. URL https://api.semanticscholar.org/CorpusID:258841365.
- The asa statement on p-values: Context, process, and purpose. The American Statistician, 70:129 – 133, 2016. URL https://api.semanticscholar.org/CorpusID:124084622.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
- Overview of issues in the longitudinal analysis of respiratory data. American journal of respiratory and critical care medicine, 154 6 Pt 2:S208–11, 1996. URL https://api.semanticscholar.org/CorpusID:45049299.
- Weitzman, M. Optimal search for the best alternative. Econometrica, 47:641–654, 1978. URL https://api.semanticscholar.org/CorpusID:32530881.
- Whitley, L. D. Fundamental principles of deception in genetic search. In Foundations of genetic algorithms, volume 1, pp. 221–241. Elsevier, 1991.
- Fundamental limitations of alignment in large language models. ArXiv, abs/2304.11082, 2023. URL https://api.semanticscholar.org/CorpusID:258291526.
- Race, wealth and incarceration: Results from the national longitudinal survey of youth. Race and Social Problems, 8:103–115, 2016. URL https://api.semanticscholar.org/CorpusID:13709779.
- Omni: Open-endedness via models of human notions of interestingness. arXiv preprint arXiv:2306.01711, 2023.
- Sotopia: Interactive evaluation for social intelligence in language agents. ArXiv, abs/2310.11667, 2023. URL https://api.semanticscholar.org/CorpusID:264289186.
- Bodhisattwa Prasad Majumder (39 papers)
- Harshit Surana (3 papers)
- Dhruv Agarwal (17 papers)
- Sanchaita Hazra (4 papers)
- Ashish Sabharwal (84 papers)
- Peter Clark (108 papers)