Papers similar to 2603.24543v1

~ similar to 2603.24543v1· 20 results

cs.CRcs.LGRecentMay 23, 2026

Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more

The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.

View →

cs.CRcs.CLRecentApr 14, 2026

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

Rui Yin, Tianxu Han, Naen Xu, Changjiang Li +7 more

The paper proposes a novel method to inject reliable, sustained backdoors into LLMs by compiling an activation steering vector into model weights, ensuring the backdoor only activates upon a specific…

View →

cs.CRcs.LGRecentApr 22, 2026

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more

The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…

View →

cs.CRcs.AIRecentApr 11, 2026

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Vishal Pramanik, Maisha Maliha, Susmit Jha, Sumit Kumar Jha

The paper introduces Head-Masked Nullspace Steering (HMNS), a novel geometry-aware attack method that achieves state-of-the-art jailbreak success rates by manipulating the internal attention mechanism…

View →

cs.CRRecentApr 18, 2026

HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking

Zeng Wang, Minghao Shao, Weimin Fu, Prithwish Basu Roy +5 more

The paper introduces HarmChip, a novel benchmark to evaluate LLM vulnerability to domain-specific hardware security threats, revealing that current safety guardrails fail against semantically disguise…

View →

cs.CRcs.LGRecentMay 19, 2026

Adaptive Probe-based Steering for Robust LLM Jailbreaking

Junxi Chen, Junhao Dong, Xiaohua Xie

The paper introduces an adaptive probe-based steering method that significantly improves the robustness and effectiveness of LLM jailbreaking without requiring extra prompts or manual tuning.

View →

cs.CRcs.AIRecentApr 18, 2026

SafeDream: Safety World Model for Proactive Early Jailbreak Detection

Bo Yan, Weikai Lin, Yada Zhu, Song Wang

SAFEDREAM introduces a lightweight, external world-model framework that proactively detects multi-turn jailbreak attacks by modeling cumulative safety erosion and predicting early failure points.

View →

cs.CVcs.AIcs.CLRecentMay 27, 2026

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more

The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…

View →

cs.CVcs.AIcs.CLRecentMay 27, 2026

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more

View →

cs.CRcs.AIRecentMay 10, 2026

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more

The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…

View →

cs.CRcs.AIRecentMay 19, 2026

Exploring and Developing a Pre-Model Safeguard with Draft Models

Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more

The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…

View →

cs.CRcs.AIcs.CLRecentApr 20, 2026

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

Md Rysul Kabir, Zoran Tiganj

The paper investigates how different methods of jailbreaking large language models (SFT, RLVR, and abliteration) lead to vastly different behavioral and mechanistic failures, even when all methods ach…

View →

cs.CRcs.AIRecentMay 6, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more

This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…

View →

cs.CRcs.SERecentMay 15, 2026

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

Reinelle Jan Bugnot, Soohyeon Choi, Hoon Wei Lim, Yue Duan

This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…

View →

cs.CRcs.CLRecentMay 11, 2026

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

Chiyu Zhang, Huiqin Yang, Bendong Jiang, Xiaolei Zhang +7 more

The paper introduces LITMUS, a novel benchmark that rigorously tests LLM agents for dangerous, physical-layer behavioral jailbreaks in real OS environments, revealing that current agents frequently ex…

View →

cs.CRcs.AIRecentMay 13, 2026

Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

Zvi Topol

The paper introduces a novel survival analysis framework to quantify how LLM safety degrades over repeated adversarial attacks, revealing distinct vulnerability profiles among tested models.

View →

cs.CRcs.AIcs.LGRecentApr 2, 2026

Understanding the Effects of Safety Unalignment on Large Language Models

John T. Halloran

This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…

View →

cs.AIRecentMay 28, 2026

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee

The paper introduces Temporal Logit Observability (TLO), a training-free diagnostic that analyzes the decoding process to reveal the temporal patterns of LLM safety failures, showing that failure mech…

View →

cs.CRcs.AIRecentMar 28, 2026

GUARD-SLM: Token Activation-Based Defense Against Jailbreak Attacks for Small Language Models

Md Jueal Mia, Joaquin Molto, Yanzhao Wu, M. Hadi Amini

The paper proposes GUARD-SLM, a token activation-based defense mechanism, to enhance the robustness of Small Language Models (SLMs) against various jailbreak attacks by analyzing and filtering malicio…

View →

cs.CRcs.AIRecentApr 10, 2026

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

Weiyang Guo, Zesheng Shi, Zeen Zhu, Yuan Zhou +2 more

This paper introduces a novel backdoor attack (ACB) against Reinforcement Learning with Verifiable Rewards (RLVR), demonstrating that poisoning the training data can implant a backdoor that significan…

View →