Papers similar to 2605.18168v1

~ similar to 2605.18168v1· 18 results

cs.SDcs.AIcs.CLRecentMay 28, 2026

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang +1 more

This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…

View →

cs.CRcs.AIcs.SDRecentApr 16, 2026

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

Meng Chen, Kun Wang, Li Lu, Jiaheng Zhang +1 more

The paper introduces AudioHijack, a framework that successfully demonstrates context-agnostic and imperceptible auditory prompt injection attacks, showing that commercial Large Audio-Language Models c…

View →

cs.CRcs.AIcs.CLRecentMay 6, 2026

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

Zheng Fang, Xiaosen Wang, Shenyi Zhang, Shaokang Wang +1 more

The paper introduces Token-Aware Gradient Optimization (TAGO), demonstrating that sparse optimization focusing only on high-gradient audio tokens is sufficient for effective jailbreaking of audio lang…

View →

cs.CRRecentJun 1, 2026

Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition

Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang +2 more

The paper introduces Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreaking framework that bypasses existing safety filters by decomposing harmful intent into benign input componen…

View →

cs.CRcs.SDRecentApr 17, 2026

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

Jaechul Roh, Amir Houmansadr

This paper demonstrates that benign fine-tuning significantly degrades safety in Audio LLMs, showing that the vulnerability is distinct from text and vision modalities and is highly dependent on the m…

View →

cs.CRcs.AIRecentMay 29, 2026

Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

Junyoung Park, Seongyong Ju, Sunghwan Park, Jaewoo Lee

The paper introduces Persona Attack, a novel memory injection jailbreak method that demonstrates that accumulating instructions in the model's context window can override internal safety alignments, a…

View →

cs.CRcs.AIRecentMay 29, 2026

Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

Junyoung Park, Seongyong Ju, Sunghwan Park, Jaewoo Lee

The paper introduces Persona Attack, a novel memory injection jailbreak method that demonstrates how accumulating instructions in the model's context window can override internal safety alignments, ac…

View →

cs.CRcs.SERecentMay 15, 2026

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

Reinelle Jan Bugnot, Soohyeon Choi, Hoon Wei Lim, Yue Duan

This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…

View →

cs.CRcs.AIRecentMay 6, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more

This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…

View →

cs.CRcs.AIRecentMay 9, 2026

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

Yu Chen, Yuanhao Liu, Qi Cao

The paper theorizes that aligned LLMs remain jailbreakable due to 'Refusal-Escape Directions' (RED), which are continuous perturbation paths that shift model behavior from refusal to answering, and sh…

View →

cs.CRcs.AIRecentMar 28, 2026

GUARD-SLM: Token Activation-Based Defense Against Jailbreak Attacks for Small Language Models

Md Jueal Mia, Joaquin Molto, Yanzhao Wu, M. Hadi Amini

The paper proposes GUARD-SLM, a token activation-based defense mechanism, to enhance the robustness of Small Language Models (SLMs) against various jailbreak attacks by analyzing and filtering malicio…

View →

cs.CLcs.CRRecentApr 1, 2026

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea

The paper introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that successfully bypasses LLM safety mechanisms by eliciting malicious content through a sequence of single…

View →

cs.CRcs.AIRecentMay 10, 2026

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more

The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…

View →

cs.CRRecentMay 4, 2026

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

Kemal Derya, Berk Sunar

The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…

View →

cs.CRcs.LGRecentApr 22, 2026

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more

The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…

View →

cs.CVcs.AIcs.CLRecentMay 27, 2026

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more

The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…

View →

cs.CVcs.AIcs.CLRecentMay 27, 2026

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more

View →

cs.CRcs.AIRecentMay 18, 2026

DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs

Wenzhuo Xu, Zhipeng Wei, Zonghao Ying, Deyue Zhang +3 more

The paper proposes DMN, a compositional jailbreak framework that utilizes distributed instructions, multimodal evidence, and a number chain task across multiple images to significantly enhance the att…

View →