~ similar to 2605.28916· 20 results
This case study demonstrates that in complex scientific software development, human domain expertise and careful supervision are more critical to ensuring the trustworthiness of AI-generated code than…
This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…
Zimo Ji, Zongjie Li, Wenyuan Jiang, Yudong Gao +1 more
The paper independently stress-tests Claude Code's auto mode permission system using a deliberately ambiguous benchmark, finding that its true false negative rate is significantly higher than reported…
VESTA introduces a novel agent framework that enhances Visual Language Models (VLMs) by equipping them with a dynamic, reusable toolkit of diagnostic and statistical tools, significantly improving aut…
The paper introduces Hyperparam, a set of lightweight JavaScript libraries designed to enable direct, model-aware querying of unstructured data (like agent traces) within client-side AI applications.
MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…
This paper proposes a two-stage method to improve the efficiency and robustness of the Locally Aligned Ant Technique (LAAT) for detecting cosmic structures in noisy, high-dimensional point clouds.
The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…
MOOSE-Copilot is a novel web-based framework that unifies scientific hypothesis discovery by formalizing human-AI interaction, significantly improving performance over autonomous LLM baselines.
Weitong Qian, Beicheng Xu, Zhongao Xie, Bowen Fan +15 more
AutoSci is a memory-centric agentic system designed to automate the entire scientific research lifecycle by integrating structured memory, multi-stage execution, and continuous self-improvement.
The study compares agentic data retrieval using unstructured web data versus structured, semantically-annotated datasets, concluding that semantic metadata remains essential for high-precision, reliab…
Jiazhen Lei, Tianze Cao, Yuxin Sha, Sihan Wang +4 more
The paper introduces RadioMaster, a novel multi-agent system that successfully translates high-level user intents into physically viable, real-world radio signals, significantly outperforming existing…
Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding +3 more
The paper introduces LongDS, a new benchmark for long-horizon, multi-turn data analysis, demonstrating that current AI agents struggle significantly with maintaining and updating complex analytical st…
Zhe Zhao, Haibin Wen, Yingcheng Wu, Jiaming Ma +9 more
The paper introduces Science Earth, a planet-scale scientific runtime that enables diverse, siloed AI capabilities to connect and collaborate dynamically, demonstrating that scientific discovery can b…
Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim +2 more
The paper introduces GAIATrace, a comprehensive token-level dataset, and Vidur-Agent, a simulator, to enable reproducible and detailed system-level characterization of complex multi-model agentic AI s…
The paper proposes an empowerment-guided multi-agent system that uses semantic checkpoints and structured communication to ensure that complex scientific computing workflows maintain semantic consiste…
Aaron Chan, Tengfei Li, Tianyi Xiao, Angela Chen +2 more
The paper introduces LATTICE, a novel benchmark for evaluating how well crypto agents assist user decision-making, finding that different agents excel in different specific areas rather than having a…
AgenticVM is a multi-agent framework that uses LLMs and specialized tools to automate and drastically reduce the volume of software vulnerabilities into actionable, prioritized queues.
AutoVerifier is an LLM-based agentic framework that automates the end-to-end verification of complex technical claims, enabling non-experts to generate evidence-backed intelligence assessments.
The paper argues that current 'on-the-fly' AI agent design lacks necessary software engineering rigor and proposes an 'AI Workflow Store' to provide hardened, reusable, and reliable agent workflows.