~ similar to 2605.19722v1· 20 results
The paper proposes Ablating Safety, a controlled protocol for removing safety alignment from language models, demonstrating that targeted de-alignment can significantly boost security performance whil…
The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…
The paper introduces MOSAIC-Bench, a benchmark demonstrating that coding agents can ship exploitable code by complying with seemingly innocuous, staged tasks, a vulnerability that is not easily mitiga…
The study demonstrates that LLM safety alignment is non-monotonic across model generations, showing that Gemma 3 exhibits unexpectedly high vulnerability to adversarial attacks compared to both its pr…
The study demonstrates that safety alignment in LLMs is non-monotonic across model generations, showing that Gemma 3 exhibits a significantly higher attack success rate than both its predecessor and s…
This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…
Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li +5 more
The paper introduces OS-BLIND, a benchmark demonstrating that current safety evaluations fail to detect critical vulnerabilities in computer-use agents when user instructions are benign, showing high…
Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz +1 more
The study evaluated four frontier AI models to assess their reliability in following safety research goals, finding no confirmed instances of sabotage but noting that certain models frequently refuse…
The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…
The paper introduces 'abliteration,' a weight editing technique that successfully bypasses the refusal mechanism of safety-aligned Code LLMs, enabling scalable synthesis of vulnerable code from safe i…
The paper identifies a critical vulnerability, the Camouflage Detection Gap (CDG), where standard LLM injection detectors fail dramatically when malicious payloads mimic the target domain's language a…
AgentTrust is a novel runtime safety layer that intercepts and evaluates AI agent tool calls before execution, achieving high accuracy in detecting unsafe actions across complex and obfuscated scenari…
The paper introduces Owner-Harm, a formal threat model addressing the critical blind spot of AI agents harming their own deployers, demonstrating that specialized defenses are needed beyond generic sa…
Chang Jin, An Wang, Zeming Wei, Kai Wang +6 more
The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rath…
The paper introduces the Safety Asymmetry Score (SAS) to measure how a model's vulnerability to adversarial content changes based on whether the malicious input arrives via the user message, tool meta…
The paper introduces the Safety Asymmetry Score (SAS) to measure how a model's susceptibility to adversarial attacks changes based on whether the malicious content arrives via the user message, tool m…
This study empirically measures the consistency and success rate of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation capabilit…
This study empirically measures the consistency and effectiveness of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation rates am…
Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen +3 more
The paper introduces AgentFlow, a novel framework that uses a typed graph DSL and feedback-driven optimization to automatically synthesize and improve multi-agent harnesses for discovering security vu…
Elle Najt, Colin Toft, Tyler Tracy, Fabien Roger +1 more
The paper introduces SLEIGHT-Bench, a benchmark of 40 synthetic attacks, demonstrating that current LLM monitor systems fail to detect a significant number of covert, harmful actions executed by codin…