Skip to main navigation Skip to search Skip to main content

Large Language Models for Spreadsheets: Benchmarking Progress and Evaluating Performance with FLARE

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Large Language Models (LLMs) have demonstrated some significant capabilities across various
domains; however, their effectiveness in spreadsheet related tasks remains underexplored. This study
introduces a foundation for a comprehensive benchmark framework to evaluate the performance of
leading LLMs in executing spreadsheet functions, formula generation, and data manipulation tasks.
The benchmark encompasses tasks ranging from basic formula creation to complex, real world
spreadsheet scenarios. Our findings reveal that while LLMs exhibit proficiency in straightforward
tasks, they often falter in complex, multi step operations, frequently producing plausible yet incorrect
outputs. These results underscore the limitations of current LLMs in handling spreadsheet tasks that
require precise logical reasoning and highlight the need for integrating symbolic reasoning
capabilities into LLM architectures. To support this, we introduce FLARE (Formula Logic, Auditing,
Reasoning and Evaluation) a new benchmark for eval
Original languageEnglish
Title of host publicationProceedings of the EuSpRIG 2025 Conference "Spreadsheet Productivity & Risks"
EditorsSimon Thorne, Grenville J. Croll
PublisherEuropean Spreadsheet Risks Interest Group
ISBN (Print)9781905404605
Publication statusPublished - 1 Jul 2025
EventEuSpRIG 2025 Conference Spreadsheet Productivity & Risks - University of Greenwich, London, United Kingdom
Duration: 3 Jul 20254 Jul 2025

Conference

ConferenceEuSpRIG 2025 Conference Spreadsheet Productivity & Risks
Country/TerritoryUnited Kingdom
CityLondon
Period3/07/254/07/25

Cite this