~ similar to 2605.30907· 20 results
Xuesi Hu, Peng Wang, Jinpeng Miao, Xilin Tao +6 more
The paper introduces FinBoardBench, a novel evaluation suite using financial board games to demonstrate that current LLMs, despite strong static reasoning, fail at complex, dynamic wealth management a…
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
Taojie Zhu, Wentao Zhao, Rui Sun, Beidi Luan +6 more
The paper introduces KTD-Fin, a novel benchmark that evaluates LLM trading agents by masking historical market data and decomposing returns, finding that LLM agents' profits are largely due to passive…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
Ailiya Borjigin, Igor Stadnyk, Ben Bilski, Maksym Chikita +3 more
The paper proposes the Interaction-Native Knowledge Harness (InKH), an architecture that absorbs complex context into financial LLM agents, significantly improving performance, reducing latency, and e…
Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo +2 more
This paper introduces CFMME, a comprehensive Chinese financial multimodal benchmark, and evaluates current Large Vision-Language Models (LVLMs), finding that while state-of-the-art models perform mode…
This paper empirically demonstrates that the choice of plan representation (e.g., checklist vs. narrative) significantly impacts the robustness and success rate of LLM-based web agents.
The paper introduces FinVerBench, a comprehensive benchmark for financial statement verification, concluding that successful verification requires calibrated judgment under realistic observational con…
The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…
The paper introduces PortBench, a comprehensive benchmark that evaluates LLMs for portfolio management by assessing both correlation awareness and performance across a full, multi-stage decision pipel…
Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu +5 more
The paper introduces SkillBrew, a multi-objective framework that treats skill bank curation as a constrained optimization problem to build efficient and well-curated skill repositories for LLM agents.
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
Aaron Chan, Tengfei Li, Tianyi Xiao, Angela Chen +2 more
The paper introduces LATTICE, a novel benchmark for evaluating how well crypto agents assist user decision-making, finding that different agents excel in different specific areas rather than having a…
The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…
FundaPod is a multi-persona agent platform designed for fundamental investment research, enabling AI agents with distinct viewpoints to independently gather evidence and surface disagreements for huma…
The paper introduces a novel, training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-Judge, and further proposes an iterative fine-tuning strategy that significant…
The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…
The paper challenges the conclusion that LLMs lack reasoning by demonstrating that reported performance drops on GSM-Symbolic are often statistically weak and partially attributable to dataset biases,…