ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

20 results for “Familiarity with simulation tools”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

cs.AIRecentMay 27, 2026

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

Sara Metcalf, William Schoenberg

The BEAMS initiative establishes comprehensive benchmarks and evaluates AI tools for modeling and simulation, finding that current AI tools excel at qualitative discussion tasks but struggle with comp…

View →
cs.AIRecentMay 27, 2026

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

Liang Cheng, Mingsheng Cai, Jiuming Jiang, Luo Mai

The paper proposes FeasiGen, a method to automatically create infeasible tasks for tool-using agents, and finds that most current agents struggle significantly to detect and stop when faced with such…

View →
cs.LGcs.AIRecentMay 28, 2026

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

Tong Liu, Cheng Qian, Matej Cief, Yuan He +3 more

This paper analyzes tool-calling in LLM agents, demonstrating that evaluation results are highly sensitive to implementation details and proposing new techniques to significantly improve the efficienc…

View →
cs.AIRecentMay 28, 2026

Temporal Stability and Few-Shot Prompting in Math Task Assessment

Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn

This study investigated the stability and prompt-responsiveness of AI tools in classifying the cognitive demand of math tasks, finding that few-shot prompting was a more reliable performance booster t…

View →
cs.CVcs.AIRecentJun 1, 2026

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang +5 more

The paper argues that observed gains in multimodal agents using tools may be due to learning tool-calling patterns rather than genuine capability expansion, finding that tool access provides little co…

View →
cs.CLRecentJun 1, 2026

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

Danqing Wang, Akshay Sivaraman, Lei Li

The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…

View →
cs.LGcs.CLcs.CRRecentMay 30, 2026

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

Mohammed Sameer Syed, Rozhin Yasaei

The paper introduces the Safety Asymmetry Score (SAS) to measure how a model's vulnerability to adversarial content changes based on whether the malicious input arrives via the user message, tool meta…

View →
cs.LGcs.CLcs.CRRecentMay 30, 2026

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

Mohammed Sameer Syed, Rozhin Yasaei

The paper introduces the Safety Asymmetry Score (SAS) to measure how a model's susceptibility to adversarial attacks changes based on whether the malicious content arrives via the user message, tool m…

View →
cs.AIRecentMay 30, 2026

Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

Yufeng Wang

This paper investigates the 'faithfulness gap' in LLM agents—the discrepancy between stated reasoning and actual action—by decomposing it into two opposing steps: reasoning-to-conclusion and conclusio…

View →
cs.AIRecentMay 27, 2026

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more

The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…

View →
cs.CRcs.CLRecentMay 17, 2026

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

Lecheng Yan, Ruizhe Li, Xicheng Han, Wenxi Li +4 more

The paper introduces a new security benchmark and framework to defend LLM agents against 'cognitive poisoning,' where malicious tools build trust through benign feedback before executing a harmful fin…

View →
cs.CRcs.SERecentApr 1, 2026

Automated Generation of Cybersecurity Exercise Scenarios

Charilaos Skandylas, Mikael Asplund

The paper presents an approach to automatically generate a large number of diverse and complex cybersecurity scenarios that model enterprise IT systems for training purposes.

View →
cs.CRcs.AIcs.MARecentMay 1, 2026

Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes

Alfredo Metere

The paper proposes a trust schema and verification framework to ensure that agent skills, which augment LLMs, are rigorously verified before deployment, thereby making human-in-the-loop oversight scal…

View →
cs.CRcs.AIcs.CLRecentMay 12, 2026

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

Chang Jin, An Wang, Zeming Wei, Kai Wang +6 more

The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rath…

View →
cs.AIcs.LGcs.SERecentMay 27, 2026

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt +2 more

The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…

View →
cs.AIRecentMay 29, 2026

MAVEN: Improving Generalization in Agentic Tool Calling

Omkar Ghugarkar, Vishvesh Bhat, Muhammad Ahmed Mohsin, Asad Aali

The paper introduces MAVEN, a lightweight symbolic reasoning scaffold that significantly improves the generalization and end-to-end success rate of large language models in complex, multi-step tool-ca…

View →
cs.AIRecentJun 1, 2026

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen +3 more

The paper proposes EAPO, a framework that enables agentic models to learn when to forgo using external tools, thereby mitigating tool abuse while maintaining high reasoning accuracy.

View →
cs.AIRecentMay 27, 2026

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Kou Shi, Ziao Zhang, Shiting Huang, Avery Nie +6 more

The paper introduces AsyncTool, a new benchmark designed to evaluate LLM agents' ability to handle multiple, concurrent tasks with delayed tool feedback, demonstrating that asynchronous coordination i…

View →
cs.SEcs.CRRecentMay 10, 2026

Evaluating Tool Cloning in Agentic-AI Ecosystems

Taein Kim, David Jiang, Yuepeng Hu, Yuqi Jia +1 more

The paper presents a large-scale study demonstrating that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems, necessitating changes in how tool diversity is m…

View →
cs.CVcs.AIcs.RORecentMay 28, 2026

Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes

Chenxi Tao, Seung-Kyum Choi

The paper reframes industrial visual sim-to-real transfer as a domain-gap problem categorized by the availability of explicit object geometry (CAD), arguing that the required prior evidence dictates t…

View →