~ similar to 2605.28371· 20 results
The paper introduces Picid, a modular evaluation infrastructure that standardizes and formalizes the entire Prognostics and Health Management (PHM) evaluation pipeline to ensure reproducible and fair…
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
The paper introduces Croissant Tasks, a declarative metadata format designed to achieve conceptual reproducibility in machine learning by abstracting problem specifications from brittle implementation…
Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more
The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…
Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo +7 more
The paper introduces a unified framework to fairly evaluate LLM agentic capabilities by standardizing diverse benchmarks and separating the effects of the LLM model from the surrounding framework and…
BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li +6 more
The paper introduces TimeSage-MT, a comprehensive multi-turn benchmark designed to rigorously test an LLM agent's ability to perform complex, evolving time series analysis, revealing critical gaps in…
MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…
This paper studies AI development frameworks for software engineering and proposes a six-dimension process taxonomy.
Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens +1 more
The paper proposes a comprehensive monitoring and triage methodology for agentic systems, demonstrating that structural defects mask task-level errors and require specialized monitoring scopes for det…
Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu +1 more
The paper introduces BenchTrace, a novel benchmark designed to rigorously evaluate the self-evolution and reflection capabilities of LLM agents, revealing that current models struggle with accurate fa…
The BEAMS initiative establishes comprehensive benchmarks and evaluates AI tools for modeling and simulation, finding that current AI tools excel at qualitative discussion tasks but struggle with comp…
Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz +1 more
The study evaluated four frontier AI models to assess their reliability in following safety research goals, finding no confirmed instances of sabotage but noting that certain models frequently refuse…
Taein Kim, David Jiang, Yuepeng Hu, Yuqi Jia +1 more
The paper presents a large-scale study demonstrating that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems, necessitating changes in how tool diversity is m…
AgenticVM is a multi-agent framework that uses LLMs and specialized tools to automate and drastically reduce the volume of software vulnerabilities into actionable, prioritized queues.
This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…
The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…
The paper introduces I-WebGenBench, a framework and benchmark that converts static scientific papers into executable, interactive web systems, allowing users to dynamically explore the paper's mechani…
The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…