"Benchmarking" | ArxivCSExplorer

20 results for “Benchmarking”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →

cs.SEEmpiricalRecentJul 8, 2026

Rethinking Code Performance Benchmarks for LLMs

Nhat Minh Le, Yisen Xu, Zhijie Wang, Tse-Hsun +1 more

This paper evaluates the performance of large language models on popular benchmarks and finds that only a small percentage of the performant implementations are significantly faster than canonical sol…

View →

cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →

cs.CRcs.AIRecentMay 21, 2026

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

Sahar Abdelnabi, Chris Hicks, Konrad Rieck, Ahmad-Reza Sadeghi

This paper identifies three core weaknesses—benchmark vulnerabilities, temporal staleness, and runtime uncertainty—that undermine current AI agent security evaluations and proposes directions for buil…

View →

cs.SEcs.AIRecentMay 28, 2026

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Vedant Padwal

The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…

View →

cs.AIEmpiricalRecentJul 26, 2026

E-Bench: Benchmarking Multi-Step Tool-Use Agents in Real-World Product Scenarios

Weihuang Zheng, Tianyuan Zou, Eileen Ye, Alphet Liu +4 more

The paper introduces E-Bench, a synthetic benchmark for evaluating multi-step tool use in Large Language Models across three product domains.

View →

cs.DBcs.AIcs.CLEmpiricalRecentJul 24, 2026

DBA-Bench: A Production-Fidelity Benchmark for LLM-Based Database Operations Agents

Junming Chen, Junyang Jiang, Xu Chen, Zibo Liang +1 more

The paper introduces DBA-Bench, a benchmark for evaluating database agents with production fidelity, outcome-first evaluation, and controlled scenario reproducibility.

View →

cs.CLRecentMay 31, 2026

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta

This study benchmarks four local LLMs for natural-language-to-SQL querying in biopharma manufacturing, finding that general-purpose code-tuned models like Llama 3.1 8B and Qwen 2.5 Coder 7B outperform…

View →

cs.AIRecentMay 28, 2026

FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification

Silu Panda

The paper introduces FinVerBench, a comprehensive benchmark for financial statement verification, concluding that successful verification requires calibrated judgment under realistic observational con…

View →

cs.AIRecentJun 1, 2026

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis

Shuo Lu, Yinuo Xu, Kecheng Yu, Siru Jiang +7 more

The paper introduces WorldCoder-Bench, a comprehensive benchmark and evaluation protocol for testing LLMs' ability to autonomously generate complex, physically grounded, and interactive 3D web worlds.

View →

cs.CLcs.SEEmpiricalRecentJun 22, 2026

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Jincheng Zhong, Weizhi Wang, Che Jiang, Kai Tian +4 more

The paper introduces EnterpriseClawBench, an enterprise agent benchmark with 852 tasks and evaluation protocol, achieving a best configuration score of 0.663.

View →

cs.CLcs.AIEmpiricalRecentJul 7, 2026

Data Analysis in the Wild: Benchmarking Large Language Models Against Real-World Data Complexities

So Hasegawa, Shailaja Keyur Sampat, Lei Liu, Wei-Peng Chen

The paper introduces DataGovBench, a benchmark for evaluating Large Language Models in real-world data analysis scenarios, revealing significant performance gaps with state-of-the-art models.

View →

cs.AIRecentMay 27, 2026

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu +3 more

The paper introduces PetroBench, a comprehensive benchmark for evaluating Large Language Models across various domains of petroleum engineering, finding that models perform better on subjective tasks…

View →

cs.AIq-fin.PMRecentMay 27, 2026

PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

Yuxuan Zhao, Sijia Chen, Ningxin Su

The paper introduces PortBench, a comprehensive benchmark that evaluates LLMs for portfolio management by assessing both correlation awareness and performance across a full, multi-stage decision pipel…

View →

cs.AIcs.CRRecentMay 12, 2026

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung +2 more

The paper introduces BenchJack, an automated red-teaming system that systematically audits popular AI agent benchmarks, revealing numerous reward-hacking exploits and demonstrating a method to signifi…

View →

cs.CLcs.AIRecentMay 28, 2026

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Volodymyr Ovcharov

The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…

View →

eess.ASEmpiricalRecentJun 27, 2026

GigaSpeechBench: A Real-World Multilingual Speech-to-Text Benchmark

Yujie Tu, Yifan Yang, Tianrui Wang, Yanqiao Zhu +32 more

The paper introduces GigaSpeechBench, a comprehensive multilingual and multidimensional ASR & AST benchmark with 680 hours of human-annotated speech, featuring 12 low-resource languages, 6 Chinese dia…

View →

cs.AIRecentMay 27, 2026

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more

The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…

View →

cs.AIRecentJun 1, 2026

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu +11 more

The paper introduces SMH-Bench, a comprehensive benchmark built on a simulator to rigorously test LLM agents' ability to perform complex, environment-grounded reasoning and actions in realistic smart-…

View →