~ similar to 2606.02423· 20 results
The paper introduces HARP, a new methodology to measure how localized harm (like compromising one agent) can be amplified into significant, system-wide harm within complex multi-agent LLM workflows.
This paper introduces AgentREVEAL, a diagnostic framework showing that the utility of web retrieval in LLM agents creates a safety-utility trade-off, as relevance itself can degrade safety alignment a…
This paper introduces AgentREVEAL, a diagnostic framework that demonstrates that the utility of web retrieval in LLM agents creates a safety-utility trade-off, as relevance itself can degrade safety a…
Yukun Jiang, Yage Zhang, Michael Backes, Xinyue Shen +1 more
This paper presents HarmfulSkillBench, a large-scale benchmark demonstrating that even small percentages of publicly available skills can be misused for harmful actions, significantly lowering LLM ref…
The paper introduces Owner-Harm, a formal threat model addressing the critical blind spot of AI agents harming their own deployers, demonstrating that specialized defenses are needed beyond generic sa…
Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang +5 more
The paper introduces TurnGate, a response-aware defense mechanism that detects the earliest turn in a multi-turn dialogue where the accumulated interaction enables a harmful action, significantly impr…
Chang Jin, An Wang, Zeming Wei, Kai Wang +6 more
The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rath…
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…
Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang +2 more
The paper introduces Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreaking framework that bypasses existing safety filters by decomposing harmful intent into benign input componen…
Ki Sen Hung, Xi Yang, Chang Liu, Haoran Li +6 more
The paper introduces Jargon, a novel adversarial framework that exploits the vulnerability of LLMs to context-specific safety boundary blurring, achieving high attack success rates across multiple fro…
Weidi Luo, Xiaofei Wen, Tenghao Huang, Hongyi Wang +4 more
The paper introduces FoodGuardBench, a comprehensive benchmark and a specialized guardrail model (FoodGuard-4B) to rigorously test and mitigate the severe food safety risks posed by large language mod…
Xi Yang, Chang Liu, Zhenglin Huang, Haoran Li +3 more
This paper introduces Ghostwriter, an attack framework demonstrating that LLMs are highly vulnerable to adopting misleading viewpoints when provided with fabricated, yet credible-looking, evidence.
Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru +1 more
The paper introduces PoisonForge, a comprehensive benchmark demonstrating that even a small number of targeted poisoned examples can significantly compromise the safety and reliability of instruction-…
This paper addresses the critical need for trustworthy LLMs in science by proposing a comprehensive, multi-layered defense framework and methodology to evaluate unique scientific vulnerabilities.
This paper shifts the focus of LLM safety from preventing misalignment to investigating the model's intrinsic ability to self-recover its alignment after being corrupted by adversarial inputs.
Zheng-Xin Yong, Parv Mahajan, Andy Wang, Ida Caspary +11 more
The paper conducts a preliminary safety evaluation of the open-weight LLM Kimi K2.5, finding that while it is highly capable, it exhibits concerning dual-use risks, particularly regarding CBRNE misuse…
Junbo Zhang, Qianli Zhou, Xinyang Deng, Wen Jiang +2 more
DataShield proposes an efficient method to identify safety-degrading samples within benign datasets, preventing the degradation of LLM safety capabilities during fine-tuning.
Junbo Zhang, Qianli Zhou, Xinyang Deng, Wen Jiang +2 more
DataShield proposes an efficient method to identify safety-degrading samples within benign datasets, quantifying each sample's contribution to an LLM's compliance behavior.
Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li +5 more
The paper introduces OS-BLIND, a benchmark demonstrating that current safety evaluations fail to detect critical vulnerabilities in computer-use agents when user instructions are benign, showing high…
Xiaoyu Xu, Minxin Du, Qipeng Xie, Haobin Ke +2 more
The paper identifies 'unintended long-term state poisoning'—a security risk where routine user interactions gradually corrupt an LLM agent's persistent state—and proposes a defense mechanism called St…