ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.28371· 20 results

cs.AIcs.LGeess.SPRecentMay 27, 2026

Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains

Lev Telyatnikov, Raffael Theiler, Leandro Von Krannichfeldt, Olga Fink

The paper introduces Picid, a modular evaluation infrastructure that standardizes and formalizes the entire Prognostics and Health Management (PHM) evaluation pipeline to ensure reproducible and fair…

View →
cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →
cs.AIRecentMay 28, 2026

Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

Omar Benjelloun, Leonardo Martins Bianco, Isabelle Guyon, Thanh Gia Hieu Khuong +7 more

The paper introduces Croissant Tasks, a declarative metadata format designed to achieve conceptual reproducibility in machine learning by abstracting problem specifications from brittle implementation…

View →
cs.AIRecentJun 1, 2026

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more

The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…

View →
cs.AIRecentMay 27, 2026

A Unified Framework for the Evaluation of LLM Agentic Capabilities

Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo +7 more

The paper introduces a unified framework to fairly evaluate LLM agentic capabilities by standardizing diverse benchmarks and separating the effects of the LLM model from the surrounding framework and…

View →
cs.AIRecentJun 1, 2026

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

Shannon Serrao, Soumitra Chatterjee, Dorina Strori, Abhishek Sharma +1 more

BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.

View →
cs.SEcs.AIRecentMay 31, 2026

FVSpec: Real-World Property-Based Tests as Lean Challenges

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…

View →
cs.CLcs.AIRecentMay 31, 2026

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li +6 more

The paper introduces TimeSage-MT, a comprehensive multi-turn benchmark designed to rigorously test an LLM agent's ability to perform complex, evolving time series analysis, revealing critical gaps in…

View →
cs.AIcs.LGRecentMay 30, 2026

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

Yifan Bao, Xinyu Xi, Xinyu Liu, Wen Ge +7 more

MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…

View →
cs.SEcs.AIRecentJun 3, 2026

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

Sanderson Oliveira de Macedo

This paper studies AI development frameworks for software engineering and proposes a six-dimension process taxonomy.

View →
cs.SEcs.AIRecentJun 1, 2026

Monitoring Agentic Systems Before They're Reliable

Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens +1 more

The paper proposes a comprehensive monitoring and triage methodology for agentic systems, demonstrating that structural defects mask task-level errors and require specialized monitoring scopes for det…

View →
cs.AIRecentMay 28, 2026

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu +1 more

The paper introduces BenchTrace, a novel benchmark designed to rigorously evaluate the self-evolution and reflection capabilities of LLM agents, revealing that current models struggle with accurate fa…

View →
cs.AIRecentMay 27, 2026

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

Sara Metcalf, William Schoenberg

The BEAMS initiative establishes comprehensive benchmarks and evaluates AI tools for modeling and simulation, finding that current AI tools excel at qualitative discussion tasks but struggle with comp…

View →
cs.AIcs.CRRecentApr 1, 2026

UK AISI Alignment Evaluation Case-Study

Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz +1 more

The study evaluated four frontier AI models to assess their reliability in following safety research goals, finding no confirmed instances of sabotage but noting that certain models frequently refuse…

View →
cs.SEcs.CRRecentMay 10, 2026

Evaluating Tool Cloning in Agentic-AI Ecosystems

Taein Kim, David Jiang, Yuepeng Hu, Yuqi Jia +1 more

The paper presents a large-scale study demonstrating that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems, necessitating changes in how tool diversity is m…

View →
cs.CRRecentMay 3, 2026

AgenticVM: Agentic AI for Adaptive Software Vulnerability Management

Asrul Arifin, Hussain Ahmad, Yiyao Zhang, Diksha Goel

AgenticVM is a multi-agent framework that uses LLMs and specialized tools to automate and drastically reduce the volume of software vulnerabilities into actionable, prioritized queues.

View →
cs.AIcs.CLRecentMay 28, 2026

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Lorenz Kutschka, Bernhard Geiger

This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…

View →
cs.AIRecentJun 1, 2026

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

Iñaki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-García +4 more

The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…

View →
cs.CLRecentMay 30, 2026

I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications

Dasen Dai, Biao Wu, Meng Fang, Shuoqi Li +1 more

The paper introduces I-WebGenBench, a framework and benchmark that converts static scientific papers into executable, interactive web systems, allowing users to dynamically explore the paper's mechani…

View →
cs.AIRecentMay 28, 2026

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

Kai-Chen Cheng, Haejun Han, David Q. Sun

The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…

View →