~ similar to 2604.19461v2· 20 results
The paper introduces RefusalGuard, a novel fine-tuning framework that preserves the geometric structure of safety-relevant representations in LLMs, thereby mitigating the degradation of refusal behavi…
The paper proposes Ablating Safety, a controlled protocol for removing safety alignment from language models, demonstrating that targeted de-alignment can significantly boost security performance whil…
This study compares two methods of safety unalignment (Jailbreak-Tuning and Weight Orthogonalization) across six LLMs and finds that Weight Orthogonalization (WO) significantly enhances malicious capa…
Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun +3 more
The paper introduces Safety Bottleneck Regularization (SBR), a novel defense mechanism that anchors LLM safety by constraining the unembedding layer, effectively preventing harmful fine-tuning (HFT) e…
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving state-of-the-art safety performance with massiv…
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
Ki Sen Hung, Xi Yang, Chang Liu, Haoran Li +6 more
The paper introduces Jargon, a novel adversarial framework that exploits the vulnerability of LLMs to context-specific safety boundary blurring, achieving high attack success rates across multiple fro…
The paper introduces NeWTral, a framework that restores safety alignment to specialized LLM adapters without sacrificing their domain-specific knowledge, achieving a significant reduction in attack su…
The paper argues that shallow safety alignment in LLMs is due to autoregressive consistency, a mechanism that allows small harmful inputs to redirect the model's generation to unsafe outputs, necessit…
The paper introduces 'abliteration,' a weight editing technique that successfully bypasses the refusal mechanism of safety-aligned Code LLMs, enabling scalable synthesis of vulnerable code from safe i…
The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…
The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…
The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.
The paper introduces TWGuard, a linguistic context-optimized safety guardrail model, demonstrating that tailoring AI safety mechanisms to specific local linguistic contexts significantly improves perf…
Hao Li, Jingkun An, Zijun Song, Pengyu Zhu +7 more
SafeSteer proposes a localized on-policy distillation method that restricts safety alignment to specific safety tokens, thereby achieving strong safety performance with minimal degradation to general…
The paper introduces DeepTrap, an automated framework that evaluates security vulnerabilities in agentic language models by manipulating their internal execution contexts, demonstrating that task comp…
Lixing Lin, Juli You, Yue Li, Luyun Lin +3 more
Reflect-Guard enhances LLM safety classifiers by integrating logical self-reflection, significantly improving detection of sophisticated adversarial jailbreak prompts.
The paper evaluates prompt-injection defenses for educational LLM tutors, demonstrating that optimal security requires balancing adversarial robustness, usability, and latency, and proposing a compreh…