~ similar to 2605.28553· 20 results
The paper demonstrates that refusal behavior in Large Language Models (LLMs) is encoded as an actionable, linearly decodable signal in intermediate transformer activations, allowing for early detectio…
The paper introduces BioRefusalAudit, a method that audits the structural soundness of language model biosecurity refusals, finding that refusal behavior is highly unstable, often collapsing under min…
The paper audits the structural soundness of LLM biosecurity refusals, finding that refusal behavior is highly unstable, often collapsing under minor prompt changes, and may track legal salience rathe…
The paper introduces 'abliteration,' a weight editing technique that successfully bypasses the refusal mechanism of safety-aligned Code LLMs, enabling scalable synthesis of vulnerable code from safe i…
The paper introduces a novel framework to evaluate when and how AI agents should refuse harmful requests in offensive cybersecurity tasks, finding that most state-of-the-art models exhibit dangerously…
Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao +1 more
The paper proposes SALO, a novel detector that monitors the dynamic, layer-wise activation pattern (Refusal Trajectory) to improve jailbreak detection robustness compared to traditional methods relyin…
The paper introduces Probe-Geometry Alignment (PGA), a surgical method that removes the measurable cross-sequence memorization signature from large language models without degrading their general capa…
The paper introduces 'adversarial restlessness,' an activation-level signature in LLM residual streams, to detect multi-turn prompt injection attacks with high accuracy.
The paper shows that safety failures in low-resource languages are due to a failure in the model's safety decision calibration, not a lack of underlying knowledge, and proposes a recalibration method…
This paper systematically reviews thirteen diverse malicious-code prompt corpora used to evaluate LLM refusal, identifying critical methodological gaps in current research.
This paper systematically diagnoses the failure modes of linear deception probes in LLMs, finding that while single-direction probes are insufficient, multi-dimensional probes can recover robust detec…
Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu +3 more
The paper investigates how dynamic adversarial fine-tuning (R2D2) reorganizes the internal mechanisms (refusal geometry) of safety-aligned language models, finding that it shifts the optimal refusal c…
The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…
The paper introduces Involuntary In-Context Learning (IICL), an effective few-shot pattern completion attack that can bypass safety alignments in large language models, achieving a 24.0% bypass rate a…
Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao +2 more
The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in…
This paper proposes a multi-layered defense strategy combining pre-output monitoring, calibrated canary detection, and cumulative information-flow tracking to prevent LLM agents from exfiltrating sens…
The paper introduces RefusalGuard, a novel fine-tuning framework that preserves the geometric structure of safety-relevant representations in LLMs, thereby mitigating the degradation of refusal behavi…
Mohan Zhang, Yuqi Jia, Zhen Tan, Steven Jiang +3 more
This study provides the first systematic measurement of prompt injection attacks in a real-world LLM-based resume screening application, finding that approximately 1% of resumes contain hidden injecti…
Mohan Zhang, Yuqi Jia, Zhen Tan, Steven Jiang +3 more
This study provides the first large-scale measurement of prompt injection attacks in real-world LLM-based resume screening, finding that approximately 1% of resumes contain hidden injections.
The paper challenges the assumption that LLM safety is a binary threshold, proposing that safety failures occur in an 'instability region' and introducing Furina, a transferable attack that exploits t…