ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.28345· 20 results

cs.AIcs.LGcs.SERecentMay 27, 2026

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt +2 more

The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…

View →
cs.CLRecentMay 29, 2026

Wind Turbine Maintenance Log Labelling Framework: LLM-Driven Data Correction and Enrichment via Semantic Extraction of Reliability Intelligence

Max Malyi, Jonathan Shek, Alasdair McDonald, Andre Biscaya

The paper introduces an LLM-driven framework to automatically standardize, structure, and enrich unstructured free-text wind turbine maintenance logs, transforming qualitative field observations into…

View →
cs.LGcs.AIcs.CERecentMay 28, 2026

Scientific Machine Learning for Engine Health Management and Remaining Useful Life Prediction

Jostein Barry-Straume, Changmin Son, Adrian Sandu, Gavan Burke +3 more

The paper proposes a multi-task scientific machine learning framework that jointly predicts key engine health indicators (TGTU, DTGT) and the Remaining Useful Life (RUL) while quantifying prediction u…

View →
cs.SEcs.AIRecentJun 1, 2026

Monitoring Agentic Systems Before They're Reliable

Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens +1 more

The paper proposes a comprehensive monitoring and triage methodology for agentic systems, demonstrating that structural defects mask task-level errors and require specialized monitoring scopes for det…

View →
cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →
cs.AIRecentMay 28, 2026

Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

Omar Benjelloun, Leonardo Martins Bianco, Isabelle Guyon, Thanh Gia Hieu Khuong +7 more

The paper introduces Croissant Tasks, a declarative metadata format designed to achieve conceptual reproducibility in machine learning by abstracting problem specifications from brittle implementation…

View →
cs.AIRecentMay 28, 2026

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

Kai-Chen Cheng, Haejun Han, David Q. Sun

The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…

View →
cs.AIcs.CLcs.ETRecentJun 1, 2026

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more

The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.

View →
cs.AIRecentJun 1, 2026

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

Iñaki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-García +4 more

The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…

View →
cs.CRRecentApr 27, 2026

System-aware contextual digital twin for ICS anomaly diagnosis

Eungyu Woo, Yooshin Kim, Wonje Heo, Donghoon Shin

The paper proposes a system-aware unsupervised framework that combines lightweight online detection with a contextual digital twin and LLM to provide interpretable, actionable anomaly diagnoses for In…

View →
cs.AIRecentMay 31, 2026

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang +3 more

This paper introduces a failure-aware observability framework to diagnose wasted computation in multi-agent LLM systems by mapping recurring failure modes to online trace signals.

View →
cs.AIeess.SPRecentMay 27, 2026

GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease

Leo Y. Li-Han, Ellen L. Larson, Elizabeth B. Habermann, Cornelius A. Thiels +1 more

The paper proposes GraD-IBD, a graph-based model that reformulates longitudinal ICD diagnosis codes into temporally directed graphs to efficiently and accurately detect the risk of Inflammatory Bowel…

View →
cs.CLcs.AIcs.LGRecentMay 28, 2026

The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability

Mikhail L. Arbuzov, Lee Mosbacker, Sisong Bei, Ziwei Dong +2 more

The paper reframes LLM reliability from an impossible universal problem to a manageable, local patch-based problem, showing that sufficient interventions can be found by focusing on recurring failure…

View →
cs.CRcs.LGRecentMay 29, 2026

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Shuhao Zhang, Jiarui Li, Qi Cao, Ruiyi Zhang +1 more

The paper introduces SCOUT, a dynamic detector allocation framework that improves prompt-injection defense by predicting detector reliability and latency to optimize the trade-off between safety and o…

View →
cs.AIcs.CRRecentApr 1, 2026

UK AISI Alignment Evaluation Case-Study

Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz +1 more

The study evaluated four frontier AI models to assess their reliability in following safety research goals, finding no confirmed instances of sabotage but noting that certain models frequently refuse…

View →
cs.CRcs.AIcs.SERecentJun 3, 2026

Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications

Yutao Shi, Xiaohan Zhang, Xiangjing Zhang, Xihua Shen +4 more

This paper investigates Description-Code Inconsistency (DCI) in Model Context Protocol (MCP) servers, finding that 9.93% of real-world tools exhibit inconsistencies that create security blind spots.

View →
cs.CLcs.AIcs.CVRecentJun 1, 2026

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu +3 more

The paper introduces PaSBench-Video, a comprehensive streaming video benchmark designed to rigorously test multimodal LLMs' ability to issue proactive safety warnings, finding that current models stru…

View →
cs.CLcs.LGRecentMay 29, 2026

Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

Riju Marwah, Ritvik Garimella, Vishal Pallagani, Atishay Jain +2 more

The paper formalizes LLM degradation during long generation as 'cognitive fatigue' and introduces the Fatigue Index (FI), a measurable, model-agnostic diagnostic tool for real-time monitoring.

View →
cs.SEcs.AIRecentMay 31, 2026

FVSpec: Real-World Property-Based Tests as Lean Challenges

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…

View →
cs.DCcs.AIRecentJun 1, 2026

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

Yafan Huang, Sheng Di, Guanpeng Li

This paper systematically studies how soft errors propagate during Large Language Model (LLM) inference using a novel fault-injection framework, providing critical insights and mitigation strategies f…

View →