~ similar to 2606.02111· 18 results
This paper introduces ComicJailbreak, a new benchmark demonstrating that structured visual narratives can effectively jailbreak Multimodal Large Language Models (MLLMs), requiring new safety alignment…
Wenzhuo Xu, Zhipeng Wei, Zonghao Ying, Deyue Zhang +3 more
The paper proposes DMN, a compositional jailbreak framework that utilizes distributed instructions, multimodal evidence, and a number chain task across multiple images to significantly enhance the att…
This survey provides a comprehensive taxonomy and vulnerability-centric analysis of adversarial attacks targeting Multimodal Large Language Models (MLLMs), offering an explanatory framework for enhanc…
Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang +2 more
The paper introduces Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreaking framework that bypasses existing safety filters by decomposing harmful intent into benign input componen…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by guiding the model's internal r…
This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…
Chaoshuo Zhang, Yibo Liang, Mengke Tian, Chenhao Lin +5 more
This paper introduces TwoHamsters, a new benchmark that rigorously tests Multi-Concept Compositional Unsafety (MCCU) in text-to-image models, demonstrating that current state-of-the-art models and saf…
Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao +5 more
This paper introduces the concept of Safety Geometry Collapse, demonstrating that multimodal inputs degrade the safety separation of LLMs, and proposes ReGap, a training-free method that adaptively co…
Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more
The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…
Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more
The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…
Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more
This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…
Kejia Chen, Jiawen Zhang, Boheng Li, Pengcheng Li +5 more
The paper proposes mitigating the progressive degradation of safety in language models caused by many-shot jailbreak attacks by appending a single, fixed safety demonstration at inference time.
The paper introduces Opir, an efficient family of encoder-based multi-task guardrail models that provides competitive safety classification performance across various tasks while maintaining a signifi…
Yi Wang, Hongye Qiu, Yue Xu, Sibei Yang +3 more
The paper proposes EVA, a novel framework that uses direct model editing to surgically correct specific neurons responsible for jailbreaking vulnerabilities in LLMs and VLMs, achieving robust safety a…
Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more
The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…
Ye Leng, Junjie Chu, Mingjie Li, Chenhao Lin +4 more
The paper analyzes that while multimodal large language models (MLLMs) offer superior semantic understanding for image generation, this enhanced capability significantly increases safety risks, partic…
Doguhuan Yeke, Yanming Zhou, Leo Y. Lin, Hongyu Cai +2 more
The paper introduces RoboJailBench, the first standardized evaluation framework for assessing adversarial jailbreak attacks and defenses in embodied AI systems like robots.