~ similar to 2606.01991· 20 results
Dongwook Choi, Taeyoon Kwon, Bogyung Jeong, Minju Kim +5 more
EMBGuard introduces a novel, MLLM-based safety guardrail that explicitly identifies and explains physical hazards from (visual observation, action) pairs, enabling safer planning for embodied agents.
Zhe Liu, Zonghao Ying, Wenxin Zhang, Quanchen Zou +4 more
SafeHarbor is a novel, hierarchical memory-augmented framework that establishes context-aware decision boundaries for LLM agents, achieving state-of-the-art safety while minimizing over-refusal.
The paper demonstrates a semantic denial-of-service attack against LLM-controlled robots by injecting short, safety-plausible phrases into the audio channel, causing the robot to halt or disrupt execu…
The paper introduces a novel shielding framework for Robust MDPs (RMDPs) that guarantees safety under worst-case transition probabilities, enabling safe reinforcement learning even when transition dyn…
This paper addresses the critical need for trustworthy LLMs in science by proposing a comprehensive, multi-layered defense framework and methodology to evaluate unique scientific vulnerabilities.
Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Charles Fleming +1 more
The paper introduces MAGE, a novel defensive framework that uses a dedicated 'shadow memory' to proactively detect and mitigate long-horizon threats against LLM agents during complex, multi-step inter…
Zhen Huang, Zhihuang Liu, Mengxuan Luo, Weishang Wu +1 more
The paper proposes a novel attack paradigm demonstrating how compromising a single robot in an LLM-controlled multi-robot system can rapidly propagate malicious intent to cause coordinated unsafe acti…
Zhenhao Xu, Wenhan Chang, Yichuan Chen, Yuxin Fang +2 more
The paper proposes Safety Context Injection (SCI), an inference-time framework that prepends a structured external risk report to protect Large Reasoning Models (LRMs) against sophisticated jailbreaks…
Haoyu Wang, Zibo Xiao, Yedi Zhang, Christopher M. Poskitt +1 more
The paper proposes SafeClaw-R, a novel framework that enforces safety as a system-level invariant over the execution graph to mitigate the high safety and security risks inherent in autonomous multi-a…
Shi Liu, Xuehai Tang, Xikang Yang, Liang Lin +3 more
This paper introduces a new benchmark to test Tool Description Poisoning (TDP) attacks on LLM agents, demonstrating that even advanced models like GPT-4o are highly vulnerable and that current defense…
This paper surveys the risks associated with world models, proposing a unified threat model and demonstrating adversarial attacks that show world models require rigorous safety standards comparable to…
AgentTrust is a novel runtime safety layer that intercepts and evaluates AI agent tool calls before execution, achieving high accuracy in detecting unsafe actions across complex and obfuscated scenari…
The paper analyzes how runtime safety enforcement impacts the performance of multi-step LLM agents, finding that while safety mechanisms can block unsafe actions, they impose a significant performance…
The paper introduces Session Risk Memory (SRM), a lightweight module that enhances per-action authorization gates with trajectory-level risk assessment, significantly improving detection of distribute…
The paper introduces BOA, a novel framework that measures agent safety by exhaustively searching the entire in-budget trajectory space, thereby identifying unsafe behaviors missed by traditional sampl…
Xian Qi Loye, Qinglin Su, Zhexin Zhang, Shiyao Cui +4 more
The paper introduces RUBAS, a rubric-based reinforcement learning framework that improves agent safety by providing fine-grained, multi-dimensional rewards for complex tool-use scenarios.
Xixun Lin, Yang Liu, Yancheng Chen, Yongxuan Wu +7 more
The paper introduces SafeHarness, a novel, lifecycle-integrated security architecture that significantly reduces unsafe behavior and attack success rates in LLM agents by weaving multiple defense laye…
The paper introduces a defense-placement taxonomy for the Model Context Protocol (MCP) to systematically analyze security gaps, revealing that many vulnerabilities stem from architectural misalignment…
Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more
This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…
Shiping Chen, Qin Wang, Guangsheng Yu, Xu Wang +1 more
This paper systematizes the security challenges of open agentic systems, concluding that while attack characterization is mature, the field lacks robust guidelines for operational governance, memory i…