"benchmark" | ArxivCSExplorer

20 results for “benchmark”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

cs.SEEmpiricalRecentJul 8, 2026

Rethinking Code Performance Benchmarks for LLMs

Nhat Minh Le, Yisen Xu, Zhijie Wang, Tse-Hsun +1 more

This paper evaluates the performance of large language models on popular benchmarks and finds that only a small percentage of the performant implementations are significantly faster than canonical sol…

View →

cs.CLRecentMay 28, 2026

Auditing LLM Benchmarks with Item Response Theory

Sander Land, Daniel M. Bikel

The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…

View →

cs.PLcs.MScs.SEEmpiricalRecentJul 28, 2026

Progress in Benchmarking Generics for Mathematical Computation

Daniel Pang, Stephen M. Watt

This paper reports on SciGMark 1.5, a benchmark study of specialized and generic implementations in modern languages, examining the consequences of various generic-realization strategies and extending…

View →

cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →

cs.DBcs.AIcs.CLEmpiricalRecentJul 24, 2026

DBA-Bench: A Production-Fidelity Benchmark for LLM-Based Database Operations Agents

Junming Chen, Junyang Jiang, Xu Chen, Zibo Liang +1 more

The paper introduces DBA-Bench, a benchmark for evaluating database agents with production fidelity, outcome-first evaluation, and controlled scenario reproducibility.

View →

cs.SEcs.AIRecentMay 28, 2026

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Vedant Padwal

The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…

View →

cs.CRcs.AIRecentMay 21, 2026

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

Sahar Abdelnabi, Chris Hicks, Konrad Rieck, Ahmad-Reza Sadeghi

This paper identifies three core weaknesses—benchmark vulnerabilities, temporal staleness, and runtime uncertainty—that undermine current AI agent security evaluations and proposes directions for buil…

View →

cs.AIEmpiricalRecentJul 26, 2026

E-Bench: Benchmarking Multi-Step Tool-Use Agents in Real-World Product Scenarios

Weihuang Zheng, Tianyuan Zou, Eileen Ye, Alphet Liu +4 more

The paper introduces E-Bench, a synthetic benchmark for evaluating multi-step tool use in Large Language Models across three product domains.

View →

cs.AIRecentMay 28, 2026

FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification

Silu Panda

The paper introduces FinVerBench, a comprehensive benchmark for financial statement verification, concluding that successful verification requires calibrated judgment under realistic observational con…

View →

cs.CLcs.SEEmpiricalRecentJun 22, 2026

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Jincheng Zhong, Weizhi Wang, Che Jiang, Kai Tian +4 more

The paper introduces EnterpriseClawBench, an enterprise agent benchmark with 852 tasks and evaluation protocol, achieving a best configuration score of 0.663.

View →

cs.AIq-fin.PMRecentMay 27, 2026

PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

Yuxuan Zhao, Sijia Chen, Ningxin Su

The paper introduces PortBench, a comprehensive benchmark that evaluates LLMs for portfolio management by assessing both correlation awareness and performance across a full, multi-stage decision pipel…

View →

cs.CLcs.AIcs.HCNEWEmpiricalJul 29, 2026

APEX-Accounting

Julien Benchek, Austin Bennett, Jasmin Kern, Ryan Stevens +7 more

APEX-Accounting benchmark is introduced to assess the capability of frontier models in performing accounting tasks. Claude-Fable-5 (Max) outperforms other models with 56.4% Mean Criteria@3.

View →

cs.AIRecentJun 1, 2026

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis

Shuo Lu, Yinuo Xu, Kecheng Yu, Siru Jiang +7 more

The paper introduces WorldCoder-Bench, a comprehensive benchmark and evaluation protocol for testing LLMs' ability to autonomously generate complex, physically grounded, and interactive 3D web worlds.

View →

cs.CLcs.AIEmpiricalRecentJul 7, 2026

Data Analysis in the Wild: Benchmarking Large Language Models Against Real-World Data Complexities

So Hasegawa, Shailaja Keyur Sampat, Lei Liu, Wei-Peng Chen

The paper introduces DataGovBench, a benchmark for evaluating Large Language Models in real-world data analysis scenarios, revealing significant performance gaps with state-of-the-art models.

View →

cs.AIRecentMay 27, 2026

Verifiable Benchmarking of Long-Horizon Spatial Biology

Ian Diks, Harihara Muralidharan, Tim Proctor, Kenny Workman

The paper introduces SpatialBench-Long, a comprehensive benchmark designed to test AI agents' ability to perform end-to-end scientific reasoning and derive biological claims from complex, raw spatial…

View →

cs.CLcs.AIRecentMay 28, 2026

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Volodymyr Ovcharov

The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…

View →

cs.AIcs.CRRecentMay 12, 2026

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung +2 more

The paper introduces BenchJack, an automated red-teaming system that systematically audits popular AI agent benchmarks, revealing numerous reward-hacking exploits and demonstrating a method to signifi…

View →

cs.AIRecentMay 27, 2026

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more

The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…

View →