Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models for Spreadsheets: Benchmarking Progress and Evaluating Performance with FLARE (2506.17330v1)

Published 19 Jun 2025 in cs.SE

Abstract: LLMs have demonstrated some significant capabilities across various domains; however, their effectiveness in spreadsheet related tasks remains underexplored. This study introduces a foundation for a comprehensive benchmark framework to evaluate the performance of leading LLMs in executing spreadsheet functions, formula generation and data manipulation tasks. The benchmark encompasses tasks ranging from basic formula creation to complex, real world spreadsheet scenarios. Our findings reveal that while LLMs exhibit proficiency in straightforward tasks, they often falter in complex, multi step operations, frequently producing plausible yet incorrect outputs. These results underscore the limitations of current LLMs in handling spreadsheet tasks that require precise logical reasoning and highlight the need for integrating symbolic reasoning capabilities into LLM architectures. To support this, we introduce FLARE (Formula Logic, Auditing, Reasoning and Evaluation) a new benchmark for evaluating LLM performance on real-world spreadsheet logic, auditing, and reasoning tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Simon Thorne (14 papers)

Summary

We haven't generated a summary for this paper yet.