Can AI Refute Economic Theory? Testing Models on Hidden Errors
This presentation examines a groundbreaking experiment testing whether frontier AI models like ChatGPT, Claude, and Gemini can autonomously identify and refute errors in published economic theory. Using four peer-reviewed papers with subtle mathematical flaws, the research reveals that while current models excel at targeted verification and counterexample construction when guided by humans, they cannot yet discover deep theoretical errors independently. The findings have profound implications for the future of peer review in economics.Script
What if an AI could spot the mathematical errors that human peer reviewers miss in economics journals? This paper tests whether frontier language models can autonomously refute published economic theory by challenging them with four peer-reviewed papers, each containing a subtle error previously corrected by human experts.
The experimental design is elegantly simple. The authors uploaded full texts of four economics papers to ChatGPT, Claude, Gemini, and Refine, each containing a mathematical or conceptual flaw that had already been identified and corrected. To avoid data contamination, one test was run immediately after publication with web search disabled, ensuring the model couldn't simply retrieve the known correction.
ChatGPT Pro emerged as the clear winner, immediately recognizing logical gaps and constructing rigorous counterexamples that sometimes exceeded the elegance of published corrections. Claude showed strength in economic interpretation but weaker mathematical critique, while Gemini endorsed incorrect arguments and produced hallucinations. None of the models, however, found errors without substantial human guidance pointing them to the problematic sections.
The results reveal a crucial limitation: these models excel at targeted verification once a human flags the suspect region, but they lack the autonomous capability to discover deep theoretical flaws on their own. When directed to specific proof steps, ChatGPT Pro validated or refuted logic with precision and generated candidate counterexamples that matched professional standards.
This limitation exposes a critical vulnerability in current peer review. Economics journals admit both false acceptances and false rejections because human reviewers often undercheck proof intricacies. While full autonomous AI critique remains out of reach, integrating frontier models into editorial workflows could dramatically raise standards of rigor, provided journals adapt confidentiality protocols and manage contamination risks.
The paper makes a bold claim: a competent economist working with a frontier model can already outperform traditional refereeing for technical scrutiny. This isn't about replacing human judgment but augmenting it with computational precision where it matters most. To dive deeper into how AI is reshaping research workflows, visit EmergentMind.com and create your own presentation exploring the cutting edge.