Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data (2402.17644v2)

Published 27 Feb 2024 in cs.CL and cs.AI

Abstract: Quantitative reasoning is a critical skill to analyze data, yet the assessment of such ability remains limited. To address this gap, we introduce the Quantitative Reasoning with Data (QRData) benchmark, aiming to evaluate LLMs' capability in statistical and causal reasoning with real-world data. The benchmark comprises a carefully constructed dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText. We evaluate natural language reasoning, program-based reasoning, and agent reasoning methods including Chain-of-Thought, Program-of-Thoughts, ReAct, and code interpreter assistants on diverse models. The strongest model GPT-4 achieves an accuracy of 58%, which has much room for improvement. Among open-source models, Deepseek-coder-instruct, a code LLM pretrained on 2T tokens, gets the highest accuracy of 37%. Analysis reveals that models encounter difficulties in data analysis and causal reasoning, and struggle in using causal knowledge and provided data simultaneously. Code and data are in https://github.com/xxxiaol/QRData.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Qwen technical report. arXiv preprint arXiv:2309.16609.
  2. Abductive commonsense reasoning. In International Conference on Learning Representations.
  3. Ethan Bueno de Mesquita and Anthony Fowler. 2021. Thinking clearly with data: A guide to quantitative reasoning and analysis. Princeton University Press.
  4. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  5. TheoremQA: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, Singapore. Association for Computational Linguistics.
  6. HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online. Association for Computational Linguistics.
  7. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307.
  8. FinQA: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  9. Is gpt-4 a good data analyst? arXiv preprint arXiv:2305.15038.
  10. HiTab: A hierarchical table dataset for question answering and natural language generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1094–1110, Dublin, Ireland. Association for Computational Linguistics.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  12. OpenIntro statistics. OpenIntro Boston, MA, USA:.
  13. Is ChatGPT a good causal reasoner? a comprehensive evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11111–11126, Singapore. Association for Computational Linguistics.
  14. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196.
  15. Text-to-SQL in the wild: A naturally-occurring dataset based on stack exchange data. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), pages 77–87, Online. Association for Computational Linguistics.
  16. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  17. Jennifer L Hill. 2011. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240.
  18. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507.
  19. Execution-based evaluation for data science code generation models. In Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances), pages 28–36, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  20. Benchmarking of data-driven causality discovery approaches in the interactions of arctic sea ice and atmosphere. Frontiers in big Data, 4:642182.
  21. Kosuke Imai. 2018. Quantitative social science: an introduction. Princeton University Press.
  22. Cladder: A benchmark to assess causal reasoning capabilities of language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  23. Can large language models infer causation from correlation? arXiv preprint arXiv:2306.05836.
  24. Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050.
  25. Bizbench: A quantitative reasoning benchmark for business and finance. arXiv preprint arXiv:2311.06602.
  26. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR.
  27. The magic of IF: Investigating causal reasoning abilities in large language models of code. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9009–9022, Toronto, Canada. Association for Computational Linguistics.
  28. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
  29. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations.
  30. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  31. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  32. Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1470–1480, Beijing, China. Association for Computational Linguistics.
  33. Counterfactual story reasoning and generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5043–5053, Hong Kong, China. Association for Computational Linguistics.
  34. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  35. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529.
  36. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pages 3076–3085. PMLR.
  37. Detecting pretraining data from large language models. In NeurIPS 2023 Workshop on Regulatable ML.
  38. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  40. Neuropathic pain diagnosis simulator for causal discovery algorithm evaluation. Advances in Neural Information Processing Systems, 32.
  41. D’ya like dags? a survey on structure learning and causal discovery. ACM Computing Surveys, 55(4):1–36.
  42. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  43. Can foundation models talk causality? In UAI 2022 Workshop on Causal Representation Learning.
  44. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
  45. Lumos: Learning agents with unified data, modular design, and open-source llms. arXiv preprint arXiv:2311.05657.
  46. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, Brussels, Belgium. Association for Computational Linguistics.
  47. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  48. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823.
  49. Tablellama: Towards open large generalist models for tables. arXiv preprint arXiv:2311.09206.
  50. MultiHiertt: Numerical reasoning over multi hierarchical tabular and textual data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6588–6600, Dublin, Ireland. Association for Computational Linguistics.
  51. Knowledgemath: Knowledge-intensive math word problem solving in finance domains. arXiv preprint arXiv:2311.09797.
  52. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
  53. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3277–3287, Online. Association for Computational Linguistics.
Citations (16)

Summary

  • The paper introduces the QRData benchmark to systematically assess LLMs' statistical and causal reasoning on real-world datasets.
  • It reveals that GPT-4 achieved 58% accuracy, exposing significant gaps in current LLMs' ability to perform data-based causal analysis.
  • The findings call for enhanced training strategies and model architectures to better integrate data analysis with causal inference capabilities.

Analyzing LLMs in Statistical and Causal Reasoning: Insights from QRData Benchmark

This paper addresses the critical question of whether LLMs possess advanced capabilities in data-driven statistical and causal reasoning. While LLMs have demonstrated abilities in basic data manipulation tasks like summarization and visualization, their proficiency in handling more complex quantitative reasoning tasks remains insufficiently explored. This research introduces a new benchmark, Quantitative Reasoning with Data (QRData), which is specifically designed to systematically assess the ability of LLMs to apply statistical and causal reasoning to real-world datasets.

QRData Benchmark

QRData is a large, curated dataset consisting of 411 data-driven questions across statistical and causal reasoning domains. The questions are accompanied by data sheets derived from textbooks and academic literature. Additionally, QRData includes an auxiliary text-only question dataset, QRText, that enables a comparison of reasoning capabilities with and without data access. The benchmark aims to evaluate various aspects of data-based quantitative reasoning through different reasoning approaches such as natural language reasoning, code-based reasoning, and agent reasoning, specifically using methods like Chain-of-Thought (CoT) and Program-of-Thoughts (PoT).

Key Findings

The paper evaluates a range of LLMs including GPT-4 and open-source models like Deepseek-coder-instruct. The best-performing model, GPT-4, achieved an accuracy of 58% on the QRData, signifying substantial room for improvement. Open-source models like Deepseek-coder-instruct were less accurate, achieving 37% at best. The primary difficulties for these models lie in conducting data analysis and performing causal reasoning, which suggests that current training regimens are inadequate for more sophisticated reasoning tasks.

Difficulties in Data-Based Reasoning

The majority of models evaluated show better performance in the text-only QRText benchmark compared to QRData, indicating that data analysis presents a significant challenge. There’s a notable disparity in model performance across statistical versus causal questions. This suggests that while LLMs might have acquired some level of statistical reasoning through training corpora, their causal reasoning abilities remain notably deficient. Models like GPT-4, despite vast pretraining datasets, struggle to integrate causal knowledge with data-driven contexts, often relying on correlation rather than causation insights.

Implications and Future Directions

The findings point towards several implications for future advancements in AI. Practically, improving the ability of LLMs to understand and manipulate real-world data accurately can revolutionize fields like data science, econometrics, and healthcare analytics. Theoretically, the research underscores the need for specialized training strategies that prioritize causal learning and advanced data reasoning. Enhancing model architectures and integrating more sophisticated data analysis capabilities could guide the future work of AI in areas needing causal inference and complex statistical analyses.

Final Observations

The gap highlighted by this research serves as a call to action for the AI community to refine LLM capabilities beyond language manipulation to truly intelligent quantitative reasoning. Closing this gap will involve persistent efforts in model architecture refinements, training data enhancements, and method innovations. As LLMs continue to evolve, their integration into applications requiring deep reasoning with real-world data will bring new opportunities and challenges. The QRData benchmark is poised to play a crucial role in pushing this boundary, making it an essential tool for the next generation of AI researchers focused on reasoning and data comprehension.

Youtube Logo Streamline Icon: https://streamlinehq.com