The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on average.
We present BlueFin, a benchmark that tasks large language model (LLM) agents with synthesis, manipulation, and comprehension tasks over spreadsheet workbooks in the professional finance domain. Though estimates of the global population of paying users of spreadsheet software range in the hundreds of millions -- an order of magnitude more than the estimated global population of professional developers -- comparatively fewer resources have been devoted to exploring and expanding LLM capabilities in the spreadsheet domain, with fewer still dedicated to mirroring real occupational tasks encountered by those in professional finance roles. In response, we curate a set of 131 challenging, complex tasks with real-world relevance in the domain, containing 3,225 granular rubric criteria; notably, our rubric criteria and LM judge evaluations are validated by a team of expert human annotators, resulting in high-quality, granular evaluations of complex tasks that are difficult to verify programmatically but can be reliably evaluated by an LM judge agent. Our judge achieves parity with expert consensus ($α=0.826$) with a macro-F1 score of 0.839. Frontier LLMs demonstrate poor performance on the challenging benchmark, with the strongest LLMs achieving less than 50\% average scores across tasks -- models exhibit particular weaknesses in dynamic correctness. Our contributions include a dataset of examples across three categories of spreadsheet tasks, an open source harness and agentic evaluation framework, and a characterization of existing frontier models' performance on our benchmark.
Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents
The paper proposes the Interaction-Native Knowledge Harness (InKH), an architect…
Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Eva…
This paper introduces CFMME, a comprehensive Chinese financial multimodal benchm…
Global Policy-Space Response Oracles for Two-Player Zero-Sum Games
The paper introduces Global PSRO, a novel deep reinforcement learning framework…
Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning
The paper introduces DOMINO, a novel inductive framework that synthesizes domain…
FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via…
The paper introduces FinBoardBench, a novel evaluation suite using financial boa…
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents
The paper introduces LearnWeak, an annotation-free framework that automatically…
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
The paper introduces VibeSearchBench, a new benchmark designed to evaluate long-…
PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say
The paper introduces PrivacyPeek, a new benchmark that audits the acquisition st…