Papers similar to 2605.10998v1

~ similar to 2605.10998v1· 20 results

cs.LGcs.AIcs.CRRecentMay 4, 2026

Self-Mined Hardness for Safety Fine-Tuning

The paper proposes a novel safety fine-tuning method that uses the target model's own rollouts to identify and train on the hardest prompts, significantly reducing jailbreak success rates while mainta…

View →

cs.CRcs.AIRecentMay 8, 2026

Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

Kejia Chen, Jiawen Zhang, Boheng Li, Pengcheng Li +5 more

The paper proposes mitigating the progressive degradation of safety in language models caused by many-shot jailbreak attacks by appending a single, fixed safety demonstration at inference time.

View →

cs.LGcs.CLcs.CRRecentApr 29, 2026

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu +3 more

The paper investigates how dynamic adversarial fine-tuning (R2D2) reorganizes the internal mechanisms (refusal geometry) of safety-aligned language models, finding that it shifts the optimal refusal c…

View →

cs.CRRecentMay 23, 2026

Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling

Luoyu Chen, Weiqi Wang, Zhiyi Tian, Feng Wu +2 more

The paper proposes Ellipsoid Control, a white-list defense mechanism that uses benign data geometry to constrain model updates, thereby enhancing jailbreak safety while preserving the utility of harml…

View →

cs.CRcs.SDRecentApr 17, 2026

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

Jaechul Roh, Amir Houmansadr

This paper demonstrates that benign fine-tuning significantly degrades safety in Audio LLMs, showing that the vulnerability is distinct from text and vision modalities and is highly dependent on the m…

View →

cs.CRcs.AIcs.LGRecentMay 14, 2026

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

Itay Zloczower, Eyal Lenga, Gilad Gressel, Yisroel Mirsky

The paper demonstrates that current defenses against malicious fine-tuning of foundation models are insufficient because they only address fixed attacks, and introduces a unified adaptive attack that…

View →

cs.CRcs.AIcs.CLRecentApr 20, 2026

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

Md Rysul Kabir, Zoran Tiganj

The paper investigates how different methods of jailbreaking large language models (SFT, RLVR, and abliteration) lead to vastly different behavioral and mechanistic failures, even when all methods ach…

View →

cs.LGcs.CRRecentMay 26, 2026

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

Kevin Kuo, Chhavi Yadav, Virginia Smith

This paper demonstrates that existing open-weight LLM safeguards are vulnerable to simple, non-gradient-based attacks like abliteration and prefilling, significantly increasing the attack success rate…

View →

cs.CRcs.AIcs.LGRecentJun 2, 2026

Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

Vincent Limbach, Jonas Dornbusch, David Lüdke, Stephan Günnemann +1 more

The paper introduces Indirect Harm Optimization (IHO), a novel black-box, adaptive, and efficient attack method that significantly improves jailbreak success rates against LLMs, aiming to provide a st…

View →

cs.CRcs.LGRecentApr 6, 2026

Fine-Tuning Integrity for Modern Neural Networks: Structured Drift Proofs via Norm, Rank, and Sparsity Certificates

Zhenhang Shang, Kani Chen

The paper introduces Fine-Tuning Integrity (FTI), a security goal that uses Succinct Model Difference Proofs (SMDPs) to cryptographically prove that a fine-tuned model update adheres to specific struc…

View →

cs.CRcs.CLRecentApr 9, 2026

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more

The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…

View →

cs.CRcs.AIcs.LGRecentApr 2, 2026

Understanding the Effects of Safety Unalignment on Large Language Models

John T. Halloran

This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…

View →

cs.CRcs.AIcs.LGRecentMay 9, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more

This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…

View →

cs.CRcs.AIcs.LGRecentMay 11, 2026

Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

Phongsakon Mark Konrad, Toygar Tanyel, Serkan Ayvaz

The paper introduces Acceptance Cards, a rigorous four-diagnostic standard, to provide a comprehensive and reliable evaluation protocol for claims of safe fine-tuning defenses.

View →

cs.CRcs.AIcs.CLRecentMar 30, 2026

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin +1 more

The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation t…

View →

cs.CRcs.AIRecentMay 9, 2026

Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success

Carsten Maple, Abhishek Kumar, Riya Tapwal

This paper argues that reporting only the best-case attack success rate for jailbreaks is insufficient, proposing new distributional metrics (VSM and UC) to better characterize the true threat posed b…

View →

cs.CRcs.CVRecentMay 19, 2026

Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures

Zeyao Liu, Zhendong Zhao, Xiaojun Chen, Xin Zhao +2 more

The paper introduces VIPER, a novel backdoor attack framework that exploits the functional fusion of malicious and benign logic within dynamic prompt architectures, demonstrating a new, high-risk thre…

View →

cs.LGcs.CRRecentJun 3, 2026

Sequential Data Poisoning in LLM Post-Training

Jack Sanderson, Yihan Wang, Xiaoqian Lu, Gautam Kamath +1 more

The paper introduces the threat model of sequential data poisoning, demonstrating that multiple, collaborating attackers can exploit compound vulnerabilities in LLM post-training pipelines that are in…

View →

cs.CRcs.SERecentMay 15, 2026

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

Reinelle Jan Bugnot, Soohyeon Choi, Hoon Wei Lim, Yue Duan

This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…

View →

cs.CRcs.AIcs.CLRecentApr 27, 2026

Jailbreaking Frontier Foundation Models Through Intention Deception

Xinhe Wang, Katia Sycara, Yaqi Xie

The paper introduces a novel multi-turn jailbreaking method that exploits the vulnerability of safe completion models by gradually building conversational trust, and it also uncovers a new vulnerabili…

View →