~ similar to 2605.28345· 20 results
The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…
The paper introduces an LLM-driven framework to automatically standardize, structure, and enrich unstructured free-text wind turbine maintenance logs, transforming qualitative field observations into…
Jostein Barry-Straume, Changmin Son, Adrian Sandu, Gavan Burke +3 more
The paper proposes a multi-task scientific machine learning framework that jointly predicts key engine health indicators (TGTU, DTGT) and the Remaining Useful Life (RUL) while quantifying prediction u…
Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens +1 more
The paper proposes a comprehensive monitoring and triage methodology for agentic systems, demonstrating that structural defects mask task-level errors and require specialized monitoring scopes for det…
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
The paper introduces Croissant Tasks, a declarative metadata format designed to achieve conceptual reproducibility in machine learning by abstracting problem specifications from brittle implementation…
The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…
Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more
The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.
The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…
The paper proposes a system-aware unsupervised framework that combines lightweight online detection with a contextual digital twin and LLM to provide interpretable, actionable anomaly diagnoses for In…
Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang +3 more
This paper introduces a failure-aware observability framework to diagnose wasted computation in multi-agent LLM systems by mapping recurring failure modes to online trace signals.
The paper proposes GraD-IBD, a graph-based model that reformulates longitudinal ICD diagnosis codes into temporally directed graphs to efficiently and accurately detect the risk of Inflammatory Bowel…
Mikhail L. Arbuzov, Lee Mosbacker, Sisong Bei, Ziwei Dong +2 more
The paper reframes LLM reliability from an impossible universal problem to a manageable, local patch-based problem, showing that sufficient interventions can be found by focusing on recurring failure…
Shuhao Zhang, Jiarui Li, Qi Cao, Ruiyi Zhang +1 more
The paper introduces SCOUT, a dynamic detector allocation framework that improves prompt-injection defense by predicting detector reliability and latency to optimize the trade-off between safety and o…
Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz +1 more
The study evaluated four frontier AI models to assess their reliability in following safety research goals, finding no confirmed instances of sabotage but noting that certain models frequently refuse…
Yutao Shi, Xiaohan Zhang, Xiangjing Zhang, Xihua Shen +4 more
This paper investigates Description-Code Inconsistency (DCI) in Model Context Protocol (MCP) servers, finding that 9.93% of real-world tools exhibit inconsistencies that create security blind spots.
Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu +3 more
The paper introduces PaSBench-Video, a comprehensive streaming video benchmark designed to rigorously test multimodal LLMs' ability to issue proactive safety warnings, finding that current models stru…
Riju Marwah, Ritvik Garimella, Vishal Pallagani, Atishay Jain +2 more
The paper formalizes LLM degradation during long generation as 'cognitive fatigue' and introduces the Fatigue Index (FI), a measurable, model-agnostic diagnostic tool for real-time monitoring.
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
This paper systematically studies how soft errors propagate during Large Language Model (LLM) inference using a novel fault-injection framework, providing critical insights and mitigation strategies f…