- The paper demonstrates that current GPT-based tools can generate ERFR-compliant models for basic tasks but struggle with complex and ambiguous instructions.
- The study employs a systematic evaluation framework using 16 analytic prompts to measure accuracy, formula integrity, and reusability standards.
- The analysis reveals that despite potential for rapid prototyping, expert supervision remains essential due to inconsistent outputs and reproducibility issues.
Introduction
This paper systematically evaluates the current capabilities of GPT-based assistants, with a focus on their application to the generation of reusable, analytic spreadsheet models. The investigation centers on the practical utility of these tools by introducing the Essential Requirements for Reusability (ERFR) as the evaluation standard: all inputs explicitly in cells, all calculations via cell formulas, absence of hardwired numbers, proper labeling, and accuracy of computations. Among five high-usage GPT extensions, Excel AI by pulsrai.com (EAI) demonstrates the most consistent and structured outputs and is selected for focused experimental analysis.
Methodology
The tool selection process involves screening prominent GPT-based extensions for spreadsheet modeling, emphasizing high adoption, reliability, spreadsheet generation capabilities, and integration with code-interpreter functionalities. The evaluation leverages sixteen simple analytic prompts to probe conversational quality, user interaction, error handling, file reliability, and spreadsheet design. A consistent protocol is used: prompt submission, monitoring responses, and validating the operational and structural integrity of the generated Excel files.
Prompting Strategy
A bifurcated prompting approach separates the problem statement (the computation logic or business question) from explicit instructions specifying deliverables (Excel model, cell formulas, downloadable file). Three minimal yet effective instruction statements are adopted to achieve ERFR-compliant outputs.
Results
EAI reliably produces models meeting ERFR for straightforward data-driven and parameterized tasks, correctly handling both concrete variable values and symbolic parameters. It also demonstrates robustness to language ambiguity (e.g., "month" vs. "30 days") and exhibits no difficulty with novel lexical constructs (e.g., "snapplees" as a stand-in for item types).
However, failures occur on tasks involving implicit time ranges (e.g., modeling a "month" as a flexible interval), where EAI sometimes returns static number columns rather than cell formulas. This breaks reusability, as model updates cannot propagate through formulas—a core ERFR violation. Additionally, EAI occasionally produces "garbage" formulas and shows run-to-run output variability, underscoring ongoing reproducibility limitations.
Large Task Analysis: The Wall Task
In extending prompt complexity, the Wall Task (a multi-parameter, paragraph-length business scenario) reveals further weaknesses. Initial responses often lack formulas entirely; subsequent interaction (affirmative selection for a "dynamic calculator") can yield near industrial-quality modules, but alternate runs produce Excel files with critical formulaic omissions or corrupt structures. The requirement for active user intervention to reach a valid ERFR output underscores the tool’s unreliability for unattended industrial usage.
Emergent Issues: Confidence, Workflow, and Usability
The Problem of Confidence
The central challenge is establishing trust in outputs generated by opaque, stochastic agents. While trivial ERFR aspects (inputs in cells, absence of hardwired values) are easily validated, correctness of more intricate computational logic (especially in new, non-benchmark models) remains open. Neither informal review nor automated audit tools fully address this—manual inspection or cross-validation with known-good models is still required.
The Problem of Workflow
The alternate potential for GPT-generated drafts arises: even imperfect models could reduce total development time or lower required user expertise, so long as the workflow supports efficient error correction or reconstruction. The economic justification hinges on whether prompt-iteration and post-processing cumulatively require less expertise/time than manual model creation ab initio. However, for high-stakes or large-scale financial and analytic tasks, the unreliability and need for expert validation undermine the case for deployment.
Implications and Future Directions
The experimental evidence suggests current GPT-based tools are not viable for professional, unsupervised spreadsheet model construction. Their outputs are inconsistent, frequently violate key reusability standards, and cannot be assured of accuracy without expert intervention. They may offer incremental value as accelerators for model prototyping in low-risk domains or educational settings, but not as automators of “industrial quality” analytic spreadsheets.
Key research challenges remain:
- Rigorously developing prompt engineering strategies tailored to complex spreadsheet tasks.
- Systematic benchmarking on larger, real-world business modeling corpora.
- Investigating deterministic or controlled-output variants to mitigate non-reproducibility.
- Integrating intelligent post-generation auditing to increase user confidence.
Conclusion
GPT-based spreadsheet generators like Excel AI by pulsrai.com display intermittent success in modeling simple analytic scenarios but systematically fail ERFR when model logic, complexity, or instruction ambiguity increases. Continued advancements in prompt engineering, output control, and integration with spreadsheet auditing frameworks are prerequisites for professional reliability. For now, expert supervision remains necessary, and the aspiration of fully automated, reusable, and accurate model generation is not realized. Future research should prioritize prompt robustness, reproducibility, and scalable automated validation to unlock practical impact in analytic workflow automation.