ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.28032· 19 results

stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →
stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →
cs.CLRecentJun 1, 2026

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more

The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…

View →
cs.CLRecentMay 31, 2026

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta

This study benchmarks four local LLMs for natural-language-to-SQL querying in biopharma manufacturing, finding that general-purpose code-tuned models like Llama 3.1 8B and Qwen 2.5 Coder 7B outperform…

View →
cs.CRcs.AIRecentApr 22, 2026

CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge

Gustav Keppler, Ghada Elbez, Veit Hagenmeyer

The paper introduces CyberCertBench, a new benchmark suite for evaluating LLMs against industry cybersecurity certifications, finding that while frontier models perform well on general knowledge, thei…

View →
cs.CLcs.AIRecentMay 28, 2026

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Volodymyr Ovcharov

The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…

View →
cs.AIcs.CLRecentMay 28, 2026

Demystifying Data Organization for Enhanced LLM Training

Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang +7 more

This paper proposes four guidelines and two novel data ordering methods (STR and SAW) to systematically optimize data organization, significantly enhancing the stability and performance of LLM trainin…

View →
cs.CLcs.AIRecentMay 27, 2026

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

Volodymyr Ovcharov

The paper introduces UA-Legal-Bench, a comprehensive Ukrainian legal reasoning benchmark built from a massive judicial corpus, demonstrating that LLM performance is highly task-dependent and that simp…

View →
cs.CLcs.AIRecentMay 27, 2026

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…

View →
cs.AIRecentMay 27, 2026

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang +1 more

The paper introduces HRBench, a unified and comprehensive evaluation framework for systematically benchmarking and comparing various thinking-mode switching strategies in hybrid-reasoning LLMs.

View →
cs.AIRecentMay 28, 2026

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

A. J. Lew, Y. Cao, M. J. Buehler

The paper introduces ProjectionBench, a novel benchmark that progressively discloses information to evaluate LLMs' ability to generate scientific hypotheses, demonstrating that advanced models like GP…

View →
cs.CLcs.AIRecentMay 27, 2026

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

Sebastian Nagl, Ann-Kristin Mayrhofer, Martin Heidebach, Aleyna Koçak +5 more

The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…

View →
cs.CLcs.CRRecentMay 1, 2026

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang +1 more

The paper introduces ML-Bench, a policy-grounded multilingual safety benchmark, and ML-Guard, a superior guardrail model that enables culturally and legally aligned safety assessment for LLMs across 1…

View →
cs.CRcs.LGcs.SERecentApr 30, 2026

REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)

Jun Yeon Won, Xin Jin, Shiqing Ma, Zhiqiang Lin

The paper introduces REBench, a comprehensive, standardized benchmark dataset designed to enable fair and rigorous evaluation of Large Language Models (LLMs) on complex binary reverse engineering task…

View →
cs.SEcs.AIcs.CLRecentMay 29, 2026

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more

The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…

View →
cs.CLcs.AIcs.LGRecentJun 1, 2026

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Atoosa Chegini, Soheil Feizi

The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…

View →
cs.AIRecentMay 28, 2026

OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields

Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more

The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…

View →
cs.AIRecentMay 28, 2026

FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification

Silu Panda

The paper introduces FinVerBench, a comprehensive benchmark for financial statement verification, concluding that successful verification requires calibrated judgment under realistic observational con…

View →
cs.AIRecentMay 28, 2026

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

Jiahao Huang, Fei Cheng, Junfeng Jiang, Akiko Aizawa

This paper introduces the Data-Model Compatibility (DMC) metric to quantify how suitable a dataset is for reasoning distillation, showing that optimizing data selection using DMC significantly improve…

View →