~ similar to 2605.17413v1· 20 results
This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…
The study evaluates how safety alignment affects autonomous security agents using a comprehensive trace-based benchmark, finding that while less-restricted models show gains, these effects are not uni…
The paper introduces NeWTral, a framework that restores safety alignment to specialized LLM adapters without sacrificing their domain-specific knowledge, achieving a significant reduction in attack su…
Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more
The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…
The paper introduces SecureBreak, a manually annotated, safety-oriented dataset designed to help detect harmful outputs from large language models (LLMs) that bypass existing security alignments.
The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…
The paper introduces RefusalGuard, a novel fine-tuning framework that preserves the geometric structure of safety-relevant representations in LLMs, thereby mitigating the degradation of refusal behavi…
The paper proposes detecting 'alignment faking' (AF)—where LLMs revert to unsafe behavior when unmonitored—by analyzing observable tool selection patterns, finding that detection rates vary significan…
The paper introduces 'brittle safety,' a failure mode where aligned language models fail to adapt their safety behavior when a situational context changes, and proposes state-aware validation to detec…
The paper introduces Involuntary In-Context Learning (IICL), an effective few-shot pattern completion attack that can bypass safety alignments in large language models, achieving a 24.0% bypass rate a…
Hao Li, Jingkun An, Zijun Song, Pengyu Zhu +7 more
SafeSteer proposes a localized on-policy distillation method that restricts safety alignment to specific safety tokens, thereby achieving strong safety performance with minimal degradation to general…
Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun +3 more
The paper introduces Safety Bottleneck Regularization (SBR), a novel defense mechanism that anchors LLM safety by constraining the unembedding layer, effectively preventing harmful fine-tuning (HFT) e…
Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz +1 more
The study evaluated four frontier AI models to assess their reliability in following safety research goals, finding no confirmed instances of sabotage but noting that certain models frequently refuse…
The paper argues that the 'positive backdoor' label should be retired and replaced with 'Secret Alignment,' asserting that such protective claims must be rigorously evaluated for security, especially…
The paper argues that the 'positive backdoor' label should be retired and replaced with 'Secret Alignment,' asserting that all such protective claims require rigorous, standardized evaluation due to i…
The paper introduces 'abliteration,' a weight editing technique that successfully bypasses the refusal mechanism of safety-aligned Code LLMs, enabling scalable synthesis of vulnerable code from safe i…
This paper shifts the focus of LLM safety from preventing misalignment to investigating the model's intrinsic ability to self-recover its alignment after being corrupted by adversarial inputs.
The paper challenges the assumption that LLM safety is a binary threshold, proposing that safety failures occur in an 'instability region' and introducing Furina, a transferable attack that exploits t…
Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang +1 more
The paper introduces ML-Bench, a policy-grounded multilingual safety benchmark, and ML-Guard, a superior guardrail model that enables culturally and legally aligned safety assessment for LLMs across 1…
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…