~ similar to 2605.11217v1· 20 results
Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more
The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…
Zhihao Liu, Yifan Wu, Jian Lou, Di Wang +2 more
The paper proposes a novel zeroth-order optimization framework to enhance the robustness of LLM safety alignment, showing that few refinement steps can significantly improve safety while maintaining u…
The paper introduces NeWTral, a framework that restores safety alignment to specialized LLM adapters without sacrificing their domain-specific knowledge, achieving a significant reduction in attack su…
The paper introduces SecureBreak, a manually annotated, safety-oriented dataset designed to help detect harmful outputs from large language models (LLMs) that bypass existing security alignments.
This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…
The paper proposes Ablating Safety, a controlled protocol for removing safety alignment from language models, demonstrating that targeted de-alignment can significantly boost security performance whil…
Hao Li, Jingkun An, Zijun Song, Pengyu Zhu +7 more
SafeSteer proposes a localized on-policy distillation method that restricts safety alignment to specific safety tokens, thereby achieving strong safety performance with minimal degradation to general…
Yuming Xu, Mingtao Zhang, Zhuohan Ge, Haoyang Li +6 more
This paper proposes a comprehensive taxonomy (SLOT) to systematically categorize security risks, attacks, and defenses specific to Retrieval-Augmented Generation (RAG), clarifying that these risks are…
Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun +3 more
The paper introduces Safety Bottleneck Regularization (SBR), a novel defense mechanism that anchors LLM safety by constraining the unembedding layer, effectively preventing harmful fine-tuning (HFT) e…
The paper proposes Sensitivity-Uncertainty Alignment (SUA), a framework that measures the misalignment between a model's prediction instability and its stated uncertainty to improve model reliability.
This paper shifts the focus of LLM safety from preventing misalignment to investigating the model's intrinsic ability to self-recover its alignment after being corrupted by adversarial inputs.
The paper introduces Involuntary In-Context Learning (IICL), an effective few-shot pattern completion attack that can bypass safety alignments in large language models, achieving a 24.0% bypass rate a…
Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more
The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.
The paper introduces and evaluates five parameter alignment strategies that significantly mitigate catastrophic forgetting when continually pretraining multilingual expert language models across multi…
LocalAlign proposes a generalizable prompt injection defense by generating near-target adversarial examples, which enforces a tighter robustness boundary around the correct model response.
GLiGuard introduces a compact, schema-conditioned bidirectional encoder that achieves state-of-the-art performance in LLM content moderation across multiple safety dimensions while drastically reducin…
The paper argues that shallow safety alignment in LLMs is due to autoregressive consistency, a mechanism that allows small harmful inputs to redirect the model's generation to unsafe outputs, necessit…
The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…
SafeLM is a comprehensive framework that jointly addresses privacy, security, misinformation, and adversarial robustness in federated LLMs, achieving high safety performance while significantly reduci…
Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more
ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…