~ similar to 2605.31278· 20 results
The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu +8 more
This survey provides a comprehensive, practical guide to ensuring the trustworthiness of complex, autonomous agentic AI systems by focusing on safety, robustness, privacy, and system security.
Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li +6 more
The paper introduces TimeSage-MT, a comprehensive multi-turn benchmark designed to rigorously test an LLM agent's ability to perform complex, evolving time series analysis, revealing critical gaps in…
BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.
MOOSE-Copilot is a novel web-based framework that unifies scientific hypothesis discovery by formalizing human-AI interaction, significantly improving performance over autonomous LLM baselines.
This paper introduces cost-aware Retrieval-Augmented Generation (RAG), demonstrating that fixed evidence selection is brittle and that adaptive, agentic controllers are necessary for effective knowled…
Sen Fang, Weiyuan Ding, Zhezhen Cao, Zhou Yang +1 more
AEGIS is a novel multi-agent framework that grounds vulnerability reasoning by reconstructing per-variable dependency chains over a Code Property Graph, achieving state-of-the-art performance on the P…
Yisen Gao, Yixi Cai, Tianshi Zheng, Jiaxin Bai +1 more
HypoAgent is an agentic framework that enables interactive, multi-turn abductive hypothesis generation over knowledge graphs, achieving state-of-the-art performance by integrating specialized agents f…
AutoVerifier is an LLM-based agentic framework that automates the end-to-end verification of complex technical claims, enabling non-experts to generate evidence-backed intelligence assessments.
The paper introduces Entropy-Cut Metropolis-Hastings, an efficient sampling method that uses next-token entropy to identify and resample from critical decision points in a reasoning trace, significant…
Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim +2 more
The paper introduces GAIATrace, a comprehensive token-level dataset, and Vidur-Agent, a simulator, to enable reproducible and detailed system-level characterization of complex multi-model agentic AI s…
The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…
Tong Liu, Cheng Qian, Matej Cief, Yuan He +3 more
This paper analyzes tool-calling in LLM agents, demonstrating that evaluation results are highly sensitive to implementation details and proposing new techniques to significantly improve the efficienc…
Pramana introduces a standardized, protocol-level wire format for autonomous agent outputs, ensuring that every consequential claim is accompanied by a verifiable artifact that can be re-executed by a…
The paper introduces Croissant Tasks, a declarative metadata format designed to achieve conceptual reproducibility in machine learning by abstracting problem specifications from brittle implementation…
The paper evaluates multi-agent LLM oracle systems for prediction market resolution, finding that independent aggregation with confidence-weighted voting significantly outperforms single-model baselin…
The paper introduces ProjectionBench, a novel benchmark that progressively discloses information to evaluate LLMs' ability to generate scientific hypotheses, demonstrating that advanced models like GP…
ResearchLoop introduces an evidence-gated control plane to manage and audit the state of AI-assisted computational research, mitigating the risk of unverified claims.
Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho +1 more
The paper proposes Faithful Agentic XAI (FAX), a verification framework that explicitly checks LLM-generated explanations against model behavior, significantly improving explanation faithfulness on a…