Papers similar to 2605.05995v2

~ similar to 2605.05995v2· 20 results

cs.LGcs.AIcs.CERecentMay 3, 2026

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

The paper introduces RefusalGuard, a novel fine-tuning framework that preserves the geometric structure of safety-relevant representations in LLMs, thereby mitigating the degradation of refusal behavi…

View →

cs.AIRecentMay 28, 2026

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

Zhihao Liu, Yifan Wu, Jian Lou, Di Wang +2 more

The paper proposes a novel zeroth-order optimization framework to enhance the robustness of LLM safety alignment, showing that few refinement steps can significantly improve safety while maintaining u…

View →

cs.CRRecentMay 6, 2026

You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera, Stjepan Picek +1 more

The paper introduces NeWTral, a framework that restores safety alignment to specialized LLM adapters without sacrificing their domain-specific knowledge, achieving a significant reduction in attack su…

View →

cs.CRcs.AIcs.LGRecentApr 2, 2026

Understanding the Effects of Safety Unalignment on Large Language Models

John T. Halloran

This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…

View →

cs.CRcs.CLRecentApr 9, 2026

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more

The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…

View →

cs.LGcs.CRRecentJun 2, 2026

When Autoregressive Consistency Hurts Safety Alignment

Bochen Lyu, Yiyang Jia, Xiaohao Cai, Zhanxing Zhu

The paper argues that shallow safety alignment in LLMs is due to autoregressive consistency, a mechanism that allows small harmful inputs to redirect the model's generation to unsafe outputs, necessit…

View →

cs.LGcs.AIcs.CRRecentMay 27, 2026

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo +4 more

SPARD is a defense framework that uses Safety-Projected Alternating optimization and Relevance-Diversity data selection to protect large language models from harmful fine-tuning attacks, achieving sup…

View →

cs.LGcs.AIcs.CRRecentMay 27, 2026

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo +4 more

SPARD is a defense framework that uses Safety-Projected Alternating optimization and Relevance-Diversity data selection to mitigate harmful fine-tuning attacks that undermine LLM safety.

View →

cs.AIcs.CLRecentJun 1, 2026

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu +7 more

SafeSteer proposes a localized on-policy distillation method that restricts safety alignment to specific safety tokens, thereby achieving strong safety performance with minimal degradation to general…

View →

cs.LGcs.CLRecentMay 28, 2026

CSULoRA: Closest Safe Update Low-Rank Adaptation

Oleksandr Marchenko Breneur, Adelaide Danilov, Aria Nourbakhsh, Salima Lamsiyah

CSULoRA is a post-hoc method that corrects trained LoRA adapters by estimating a safety-aligned subspace and solving a penalized minimum-change problem to attenuate unsafe update directions while pres…

View →

cs.LGcs.AIcs.CLRecentMay 30, 2026

MESA: Improving MoE Safety Alignment via Decentralized Expertise

Yitong Sun, Yao Huang, Teng Li, Ranjie Duan +4 more

MESA is a targeted alignment framework that decentralizes safety responsibilities across multiple experts in Mixture-of-Experts (MoE) LLMs using Optimal Transport theory, thereby improving safety robu…

View →

cs.LGcs.CRRecentMay 26, 2026

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

Kevin Kuo, Chhavi Yadav, Virginia Smith

This paper demonstrates that existing open-weight LLM safeguards are vulnerable to simple, non-gradient-based attacks like abliteration and prefilling, significantly increasing the attack success rate…

View →

cs.CRcs.AIRecentApr 7, 2026

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala, Idan Habler +5 more

The paper proves that no continuous, utility-preserving wrapper defense can make all inputs strictly safe for a language model with a connected prompt space, establishing a 'defense trilemma' among co…

View →

cs.CRRecentApr 17, 2026

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

Ki Sen Hung, Xi Yang, Chang Liu, Haoran Li +6 more

The paper introduces Jargon, a novel adversarial framework that exploits the vulnerability of LLMs to context-specific safety boundary blurring, achieving high attack success rates across multiple fro…

View →

cs.AIcs.CRRecentMay 18, 2026

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao +5 more

This paper introduces the concept of Safety Geometry Collapse, demonstrating that multimodal inputs degrade the safety separation of LLMs, and proposes ReGap, a training-free method that adaptively co…

View →

cs.LGcs.CLcs.CRRecentApr 29, 2026

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu +3 more

The paper investigates how dynamic adversarial fine-tuning (R2D2) reorganizes the internal mechanisms (refusal geometry) of safety-aligned language models, finding that it shifts the optimal refusal c…

View →

cs.LGcs.AIcs.CRRecentApr 8, 2026

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

Ismail Hossain, Sai Puppala, Jannatul Ferdaus, Md Jahangir Alam +3 more

The paper demonstrates that fine-tuning safety guard models on benign data can catastrophically collapse their safety alignment, proposing Fisher-Weighted Safety Subspace Regularization (FW-SSR) to ac…

View →

cs.CYcs.CRcs.HCRecentMar 25, 2026

Learning from Mistakes: Can LLM Self-Recover after Misalignment?

Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi

This paper shifts the focus of LLM safety from preventing misalignment to investigating the model's intrinsic ability to self-recover its alignment after being corrupted by adversarial inputs.

View →

cs.CRcs.CLRecentJun 4, 2026

Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

Minseok Choi, Seungbin Yang, Dongjin Kim, Subin Kim +4 more

Membrane introduces a self-evolving guardrail using Contrastive Safety Memory (CSM) that generalizes across topical jailbreak variants, achieving superior safety performance while minimizing benign re…

View →

cs.CRRecentApr 9, 2026

Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

Weiwei Qi, Zefeng Wu, Tianhang Zheng, Zikang Zhang +3 more

The paper proposes the Expected Safety Impact (ESI) framework to identify safety-critical parameters in LLMs, introducing targeted tuning methods (SET and SPA) to enhance safety and preserve alignment…

View →