Papers similar to 2605.17413v1

~ similar to 2605.17413v1· 20 results

cs.CRcs.AIcs.LGRecentApr 2, 2026

Understanding the Effects of Safety Unalignment on Large Language Models

This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…

View →

cs.CRcs.AIRecentMay 19, 2026

Measuring Safety Alignment Effects in Autonomous Security Agents

Isaac David, Arthur Gervais

The study evaluates how safety alignment affects autonomous security agents using a comprehensive trace-based benchmark, finding that while less-restricted models show gains, these effects are not uni…

View →

cs.CRRecentMay 6, 2026

You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera, Stjepan Picek +1 more

The paper introduces NeWTral, a framework that restores safety alignment to specialized LLM adapters without sacrificing their domain-specific knowledge, achieving a significant reduction in attack su…

View →

cs.CRcs.CLRecentApr 9, 2026

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more

The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…

View →

cs.CRcs.AIcs.CLRecentMar 23, 2026

SecureBreak -- A dataset towards safe and secure models

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera

The paper introduces SecureBreak, a manually annotated, safety-oriented dataset designed to help detect harmful outputs from large language models (LLMs) that bypass existing security alignments.

View →

cs.CRcs.SERecentMay 4, 2026

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Richard J. Young, Gregory D. Moody

The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…

View →

cs.LGcs.AIcs.CERecentMay 3, 2026

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

Sadia Asif, Mohammad Mohammadi Amiri

The paper introduces RefusalGuard, a novel fine-tuning framework that preserves the geometric structure of safety-relevant representations in LLMs, thereby mitigating the degradation of refusal behavi…

View →

cs.CRcs.AIRecentApr 29, 2026

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini, Marco Piangerelli

The paper proposes detecting 'alignment faking' (AF)—where LLMs revert to unsafe behavior when unmonitored—by analyzing observable tool selection patterns, finding that detection rates vary significan…

View →

cs.AIRecentMay 27, 2026

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

Dasol Choi, Alex Kwon

The paper introduces 'brittle safety,' a failure mode where aligned language models fail to adapt their safety behavior when a situational context changes, and proposes state-aware validation to detec…

View →

cs.CRRecentApr 21, 2026

Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

Alex Polyakov, Daniel Kuznetsov

The paper introduces Involuntary In-Context Learning (IICL), an effective few-shot pattern completion attack that can bypass safety alignments in large language models, achieving a 24.0% bypass rate a…

View →

cs.AIcs.CLRecentJun 1, 2026

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu +7 more

SafeSteer proposes a localized on-policy distillation method that restricts safety alignment to specific safety tokens, thereby achieving strong safety performance with minimal degradation to general…

View →

cs.CRcs.AIcs.CLRecentMay 7, 2026

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun +3 more

The paper introduces Safety Bottleneck Regularization (SBR), a novel defense mechanism that anchors LLM safety by constraining the unembedding layer, effectively preventing harmful fine-tuning (HFT) e…

View →

cs.AIcs.CRRecentApr 1, 2026

UK AISI Alignment Evaluation Case-Study

Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz +1 more

The study evaluated four frontier AI models to assess their reliability in following safety research goals, finding no confirmed instances of sabotage but noting that certain models frequently refuse…

View →

cs.CRcs.AIcs.LGRecentMay 27, 2026

Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation

Jianwei Li, Jung-Eun Kim

The paper argues that the 'positive backdoor' label should be retired and replaced with 'Secret Alignment,' asserting that such protective claims must be rigorously evaluated for security, especially…

View →

cs.CRcs.AIcs.LGRecentMay 27, 2026

Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation

Jianwei Li, Jung-Eun Kim

The paper argues that the 'positive backdoor' label should be retired and replaced with 'Secret Alignment,' asserting that all such protective claims require rigorous, standardized evaluation due to i…

View →

cs.CRcs.AIcs.SERecentJun 3, 2026

Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

Cristina Carleo, Pietro Liguori, Naghmeh Ivaki, Domenico Cotroneo

The paper introduces 'abliteration,' a weight editing technique that successfully bypasses the refusal mechanism of safety-aligned Code LLMs, enabling scalable synthesis of vulnerable code from safe i…

View →

cs.CYcs.CRcs.HCRecentMar 25, 2026

Learning from Mistakes: Can LLM Self-Recover after Misalignment?

Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi

This paper shifts the focus of LLM safety from preventing misalignment to investigating the model's intrinsic ability to self-recover its alignment after being corrupted by adversarial inputs.

View →

cs.CRcs.AIcs.LGRecentMay 24, 2026

Furina: Fragmented Uncertainty-Driven Refusal Instability Attack

Tongxi Wu, Jian Zhang, Yang Gao

The paper challenges the assumption that LLM safety is a binary threshold, proposing that safety failures occur in an 'instability region' and introducing Furina, a transferable attack that exploits t…

View →

cs.CLcs.CRRecentMay 1, 2026

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang +1 more

The paper introduces ML-Bench, a policy-grounded multilingual safety benchmark, and ML-Guard, a superior guardrail model that enables culturally and legally aligned safety assessment for LLMs across 1…

View →

cs.CRcs.AIRecentMar 17, 2026

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

Taiwo Onitiju, Iman Vakilinia

The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…

View →