~ similar to 2605.28978· 20 results
The paper introduces AbaqusAgent, a multi-AI-agent framework that uses large language models to translate natural language instructions into executable Finite Element Analysis (FEA) simulations using…
The paper introduces MUSE, a comprehensive benchmark that evaluates Text-to-CAD generation by assessing complex assemblies based on functionality, manufacturability, and assemblability, moving beyond…
The paper argues that current 'on-the-fly' AI agent design lacks necessary software engineering rigor and proposes an 'AI Workflow Store' to provide hardened, reusable, and reliable agent workflows.
Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more
The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.
The paper proposes MITL, an MsFEM-inspired transfer learning strategy for CNN-based reduced-order models, enabling efficient and adaptable approximation of multiscale systems with minimal retraining.
AgenticVM is a multi-agent framework that uses LLMs and specialized tools to automate and drastically reduce the volume of software vulnerabilities into actionable, prioritized queues.
Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more
The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…
Sebastian Cavada, Soumava Paul, Tuan-Hung Vu, Andrei Bursuc +1 more
The paper introduces NewtPhys, a novel 4D dataset of real-world scenes with dense physical annotations, to systematically evaluate and reveal the limitations of foundation models in low-level Newtonia…
The paper proposes a novel multimodal multi-agent framework that uses a topological knowledge graph to enable robust, adaptive automatic workflow execution, overcoming the limitations of treating task…
Ruiyin Li, Yiran Zhang, Xiyu Zhou, Yangxiao Cai +5 more
The paper introduces MAAD, a multi-agent framework that autonomously transforms software requirements into comprehensive, multi-view architectural blueprints, significantly improving completeness and…
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat +4 more
The paper introduces AutoformBot, a multi-agent system that successfully autoformalizes a large corpus of open-access graduate-level mathematics textbooks into a verified library in Lean 4, demonstrat…
Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo +5 more
The paper proposes a modular agent framework and novel learning methods to design and optimize practical, cost-effective, and controllable LLM-based agentic systems.
Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens +1 more
The paper proposes a comprehensive monitoring and triage methodology for agentic systems, demonstrating that structural defects mask task-level errors and require specialized monitoring scopes for det…
The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…
Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang +1 more
The paper introduces AIBuildAI-2, a knowledge-enhanced agent that significantly improves the automatic building of AI models by integrating an external, evolving knowledge system, achieving state-of-t…
The BEAMS initiative establishes comprehensive benchmarks and evaluates AI tools for modeling and simulation, finding that current AI tools excel at qualitative discussion tasks but struggle with comp…
This paper studies AI development frameworks for software engineering and proposes a six-dimension process taxonomy.
Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo +7 more
The paper introduces a unified framework to fairly evaluate LLM agentic capabilities by standardizing diverse benchmarks and separating the effects of the LLM model from the surrounding framework and…
The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…