~ similar to 2604.04992v1· 19 results
The paper theorizes that aligned LLMs remain jailbreakable due to 'Refusal-Escape Directions' (RED), which are continuous perturbation paths that shift model behavior from refusal to answering, and sh…
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…
Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more
The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…
This paper introduces ComicJailbreak, a new benchmark demonstrating that structured visual narratives can effectively jailbreak Multimodal Large Language Models (MLLMs), requiring new safety alignment…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by guiding the model's internal r…
The paper introduces Multi-Clip Video (MCV) SafetyBench, a dataset demonstrating that the vulnerability of Multimodal Large Language Models (MLLMs) to jailbreaking increases with the diversity and num…
The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…
Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more
The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.
ContextualJailbreak introduces an evolutionary red-teaming strategy that performs automated search over simulated multi-turn primed dialogues, achieving high jailbreak rates across multiple state-of-t…
The paper systematically maps LLM agent vulnerabilities by testing 10,000 prompt variations, finding that 'goal reframing' language is the primary trigger for exploitation, rather than broad adversari…
The paper introduces a novel multi-turn jailbreaking method that exploits the vulnerability of safe completion models by gradually building conversational trust, and it also uncovers a new vulnerabili…
The study evaluates how safety alignment affects autonomous security agents using a comprehensive trace-based benchmark, finding that while less-restricted models show gains, these effects are not uni…
The paper introduces a framework using the 'behavioral geometry' of model populations to efficiently predict jailbreak susceptibility and transfer defenses, achieving high accuracy with significantly…
This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…
The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…
The paper challenges the assumption that LLM safety is a binary threshold, proposing that safety failures occur in an 'instability region' and introducing Furina, a transferable attack that exploits t…
This paper investigates how on-policy Reinforcement Learning (RL) affects LLM safety, finding that safety training modulates harmful misalignment, but the direction of this effect is highly dependent…