~ similar to 2605.29586· 20 results
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
Xuesi Hu, Peng Wang, Jinpeng Miao, Xilin Tao +6 more
The paper introduces FinBoardBench, a novel evaluation suite using financial board games to demonstrate that current LLMs, despite strong static reasoning, fail at complex, dynamic wealth management a…
The paper introduces PortBench, a comprehensive benchmark that evaluates LLMs for portfolio management by assessing both correlation awareness and performance across a full, multi-stage decision pipel…
The paper proposes a federated formal verification architecture that treats verification as a polyglot proof system, successfully validating it on complex production subsystems like a Raft consensus m…
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
The paper identifies a universal, statistically predictable distribution (Mandelbrot) governing LLM outputs, enabling a highly efficient, model-agnostic scoring primitive for provenance and quality as…
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
This paper introduces Swiss-Bench 003, an expanded evaluation framework assessing LLM reliability and adversarial security across eight dimensions using 808 Swiss-specific items, revealing that self-g…
Hao Chen, Xing Tang, Qirui Liu, Weijie Shi +5 more
The paper introduces the Data-centric Reasoning Compiler (DCRC), a novel data-driven framework that enhances financial QA systems by compiling user queries and retrieved documents into verifiable, exe…
The paper proposes a trust-boundary architecture using Lean 4 to verify the deterministic structured computations surrounding LLM pipelines, providing verifiable certificates for high-stakes deploymen…
Qingwen Zeng, Zhenghao Zhao, Yitian Yang, Yiqi Zhu +5 more
This paper proposes a unified, lifecycle-centric framework and a detailed taxonomy to survey and analyze novel, finance-specific attack surfaces and vulnerabilities in AI systems used within the finan…
The paper introduces the Lean-Agent Protocol, a formal verification platform that uses Lean 4 theorem proving to ensure agentic AI actions in finance are mathematically compliant with complex regulati…
The paper proposes FinSec, a novel four-tier security detection framework, to robustly identify complex financial risks and suspicious dialogue patterns in LLM-powered financial agents, achieving stat…
The paper introduces FormInv, a measurement protocol that reveals significant semantic inconsistencies in existing mathematical reasoning benchmarks, showing that standard accuracy metrics fail to cap…
The paper introduces REBench, a comprehensive, standardized benchmark dataset designed to enable fair and rigorous evaluation of Large Language Models (LLMs) on complex binary reverse engineering task…
This paper benchmarks LLMs for smart contract security analysis, concluding that while LLMs show potential, their reliability is limited by lexical bias and requires integration with traditional stati…
Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo +2 more
This paper introduces CFMME, a comprehensive Chinese financial multimodal benchmark, and evaluates current Large Vision-Language Models (LVLMs), finding that while state-of-the-art models perform mode…
The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…
QCIVET introduces a novel contract-based framework to ensure the integrity of hybrid quantum-classical pipelines by verifying both the structure (syntactic) and the behavior (semantic) of quantum stag…
This study benchmarks four local LLMs for natural-language-to-SQL querying in biopharma manufacturing, finding that general-purpose code-tuned models like Llama 3.1 8B and Qwen 2.5 Coder 7B outperform…