Papers similar to 2605.29786

~ similar to 2605.29786· 20 results

cs.AIcs.LGcs.SERecentMay 27, 2026

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt +2 more

The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…

View →

cs.AIRecentMay 27, 2026

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more

The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…

View →

cs.AIcs.LGRecentMay 30, 2026

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

Yifan Bao, Xinyu Xi, Xinyu Liu, Wen Ge +7 more

MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…

View →

cs.AIRecentJun 1, 2026

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

Iñaki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-García +4 more

The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…

View →

cs.AIcs.CLRecentMay 28, 2026

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou +3 more

The paper introduces GTA, a scalable framework for generating realistic, multi-hop web-agent tasks with dense, executable trajectories, addressing the current lack of process-level supervision in web…

View →

cs.AIRecentMay 31, 2026

The Case for Model Science: Verify, Explore, Steer, Refine

Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel +2 more

The paper advocates for the establishment of Model Science, a systematic discipline that moves beyond simple benchmarking to deeply analyze AI models' internal workings and failure modes.

View →

cs.CRcs.AIcs.MARecentMay 1, 2026

Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes

Alfredo Metere

The paper proposes a trust schema and verification framework to ensure that agent skills, which augment LLMs, are rigorously verified before deployment, thereby making human-in-the-loop oversight scal…

View →

cs.SEcs.AIRecentMay 31, 2026

FVSpec: Real-World Property-Based Tests as Lean Challenges

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…

View →

cs.SEcs.CRRecentMay 10, 2026

Evaluating Tool Cloning in Agentic-AI Ecosystems

Taein Kim, David Jiang, Yuepeng Hu, Yuqi Jia +1 more

The paper presents a large-scale study demonstrating that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems, necessitating changes in how tool diversity is m…

View →

cs.AIRecentMay 28, 2026

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu +1 more

The paper introduces BenchTrace, a novel benchmark designed to rigorously evaluate the self-evolution and reflection capabilities of LLM agents, revealing that current models struggle with accurate fa…

View →

cs.IRcs.AIRecentMay 27, 2026

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy

The study compares agentic data retrieval using unstructured web data versus structured, semantically-annotated datasets, concluding that semantic metadata remains essential for high-precision, reliab…

View →

cs.SEcs.AIRecentMay 28, 2026

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Jun Zhang, JianYing Qu, Hanwen Du, Zhongkai Sun +2 more

The paper introduces Code-QA-Bench, a novel framework that rigorously separates genuine code reasoning from mere documentation memorization in repository-level code understanding benchmarks.

View →

cs.CRcs.LOcs.MARecentMay 19, 2026

Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks

Ravi Kiran Kadaboina

Pramana introduces a standardized, protocol-level wire format for autonomous agent outputs, ensuring that every consequential claim is accompanied by a verifiable artifact that can be re-executed by a…

View →

cs.CRcs.AIRecentJun 3, 2026

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

Yiqi Wang, Jiaqi Zhang, Taotao Cai, Zirui Liu +5 more

This survey provides a systematic framework and taxonomy for evidence tracing and execution provenance in LLM agents, addressing the difficulty of verifying and auditing complex agent behaviors.

View →

cs.SEcs.AIRecentMay 28, 2026

Inferring Code Correctness from Specification

Tambon Florian, Papadakis Mike

The paper introduces TRAILS~, a novel method that improves code correctness validation by grounding LLM reasoning in concrete (input, output) pairs derived from specifications, achieving state-of-the-…

View →

cs.SEcs.AIRecentMay 27, 2026

Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution

Swanand Rao

Tool Forge is a validation-carrying toolchain that converts natural language capability intent into governed, sandbox-verified tool artifacts, significantly improving agent efficiency and reliability.

View →

cs.CLcs.AIcs.CERecentMay 28, 2026

MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery

Hongran An, Zonglin Yang

MOOSE-Copilot is a novel web-based framework that unifies scientific hypothesis discovery by formalizing human-AI interaction, significantly improving performance over autonomous LLM baselines.

View →

cs.LGcs.CLRecentJun 2, 2026

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang +9 more

The paper proposes Skill-RM, a unified framework that treats reward modeling as an agentic task to consistently integrate diverse evaluation criteria, achieving superior performance over traditional m…

View →

cs.LGcs.AIcs.CLRecentMay 27, 2026

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Suji Kim, Kangsan Kim, Sung Ju Hwang

The paper introduces LearnWeak, an annotation-free framework that automatically specializes small computer-use agents by identifying and targeting their specific weaknesses using a stronger reference…

View →

cs.AIRecentMay 31, 2026

"Skill issues'': data-centric optimization of lakehouse agents

Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…

View →