20 results for “Benchmarking”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper introduces robustness indicators to systematically analyze how multilingual text embedding model rankings change based on dataset composition and aggregation methods, revealing that only a…
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
This paper identifies three core weaknesses—benchmark vulnerabilities, temporal staleness, and runtime uncertainty—that undermine current AI agent security evaluations and proposes directions for buil…
The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…
This study benchmarks four local LLMs for natural-language-to-SQL querying in biopharma manufacturing, finding that general-purpose code-tuned models like Llama 3.1 8B and Qwen 2.5 Coder 7B outperform…
The paper introduces FinVerBench, a comprehensive benchmark for financial statement verification, concluding that successful verification requires calibrated judgment under realistic observational con…
Shuo Lu, Yinuo Xu, Kecheng Yu, Siru Jiang +7 more
The paper introduces WorldCoder-Bench, a comprehensive benchmark and evaluation protocol for testing LLMs' ability to autonomously generate complex, physically grounded, and interactive 3D web worlds.
Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu +3 more
The paper introduces PetroBench, a comprehensive benchmark for evaluating Large Language Models across various domains of petroleum engineering, finding that models perform better on subjective tasks…
The paper introduces PortBench, a comprehensive benchmark that evaluates LLMs for portfolio management by assessing both correlation awareness and performance across a full, multi-stage decision pipel…
Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung +2 more
The paper introduces BenchJack, an automated red-teaming system that systematically audits popular AI agent benchmarks, revealing numerous reward-hacking exploits and demonstrating a method to signifi…
The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
Ojas Nimase, Zhe Chen, Gengpei Qi, Yue Zhao +1 more
The paper introduces GEO-Bench, a unified benchmark that standardizes the evaluation of various generative engine optimization (GEO) ranking manipulation attacks, demonstrating that black-box content…
Ojas Nimase, Zhe Chen, Gengpei Qi, Yue Zhao +1 more
GEO-Bench introduces a standardized benchmark to compare various ranking manipulation attacks (both black-box and white-box) on generative engines, demonstrating that black-box content rewriting can b…
Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu +11 more
The paper introduces SMH-Bench, a comprehensive benchmark built on a simulator to rigorously test LLM agents' ability to perform complex, environment-grounded reasoning and actions in realistic smart-…
The paper introduces RealVuln, a benchmark that demonstrates a clear three-tier performance hierarchy for security scanners on real-world code, with specialized tools significantly outperforming gener…
The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…
This paper systematically evaluates the consistency of popular causal discovery benchmarks against real-world scientific literature, revealing significant variability in their accuracy.
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…