Papers similar to 2605.02958v2

~ similar to 2605.02958v2· 20 results

cs.CRRecentMay 4, 2026

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…

View →

cs.CRRecentMay 23, 2026

Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling

Luoyu Chen, Weiqi Wang, Zhiyi Tian, Feng Wu +2 more

The paper proposes Ellipsoid Control, a white-list defense mechanism that uses benign data geometry to constrain model updates, thereby enhancing jailbreak safety while preserving the utility of harml…

View →

cs.CRcs.LGRecentMay 23, 2026

Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more

The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.

View →

cs.CRcs.AIRecentApr 9, 2026

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

Cheng Liu, Xiaolei Liu, Xingyu Li, Bangzhou Xin +1 more

TrajGuard is a novel, training-free defense framework that detects jailbreaks by monitoring the progressive risk signals embedded in the hidden-state trajectories of tokens during the LLM decoding pro…

View →

cs.CRcs.AIcs.LGRecentMay 9, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more

This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…

View →

cs.CRcs.AIRecentMay 9, 2026

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

Yu Chen, Yuanhao Liu, Qi Cao

The paper theorizes that aligned LLMs remain jailbreakable due to 'Refusal-Escape Directions' (RED), which are continuous perturbation paths that shift model behavior from refusal to answering, and sh…

View →

cs.AIcs.CRRecentMay 27, 2026

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

Matteo Gioele Collu, Riccardo Conte, Alberto Giaretta, Denis Kleyko +3 more

The paper demonstrates that refusal behavior in Large Language Models (LLMs) is encoded as an actionable, linearly decodable signal in intermediate transformer activations, allowing for early detectio…

View →

cs.AIcs.CRRecentMay 27, 2026

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

Matteo Gioele Collu, Riccardo Conte, Alberto Giaretta, Denis Kleyko +3 more

View →

cs.CRcs.AIcs.LGRecentJun 4, 2026

SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks

Seungwon Jeong, Jiwoo Jeong, Hyeonjin Kim, Yunseok Lee +1 more

The paper introduces SlotGCG, an improved jailbreak attack method that systematically searches for the most vulnerable token insertion positions (slots) within a prompt, significantly boosting attack…

View →

cs.CRcs.AIcs.CLRecentApr 13, 2026

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu +6 more

The paper introduces Salami Slicing Risk, a novel multi-turn jailbreak technique that accumulates harmful intent through numerous low-risk inputs, achieving state-of-the-art attack success rates again…

View →

cs.CRcs.AIcs.LGRecentMar 24, 2026

Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

Wenyu Chen, Xiangtao Meng, Chuanchao Zang, Li Wang +5 more

The paper proposes TriageFuzz, a token-aware fuzzing framework that significantly reduces the number of queries needed to jailbreak LLMs while maintaining high attack success rates.

View →

cs.CRcs.AIRecentJun 2, 2026

NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense

Zhongyang Lin, Ziran Zhao, Feifei Zhai, Pengyuan Liu

NeuroArmor is a white-box runtime defense that uses prompt-specific safe variants to selectively detect and mitigate jailbreak attacks, significantly reducing attack success rates while maintaining a…

View →

cs.AIcs.CRRecentJun 4, 2026

GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, João Vitor Pavan +6 more

GuardNet is a lightweight, ensemble-based guardrail system using shallow neural networks that provides robust and efficient detection of Prompt Injection and Jailbreak attacks on LLMs, suitable for pr…

View →

cs.LGcs.CLcs.CRRecentApr 29, 2026

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu +3 more

The paper investigates how dynamic adversarial fine-tuning (R2D2) reorganizes the internal mechanisms (refusal geometry) of safety-aligned language models, finding that it shifts the optimal refusal c…

View →

cs.CRcs.AIRecentMay 11, 2026

Re-Triggering Safeguards within LLMs for Jailbreak Detection

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang +1 more

The paper introduces an embedding disruption method to re-activate and strengthen built-in safeguards within LLMs, effectively detecting and defending against sophisticated jailbreak attacks.

View →

cs.CRcs.AIcs.LGRecentMay 26, 2026

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

Hayden Helm, Xiaodong Liu, Weiwei Yang

The paper introduces a framework using the 'behavioral geometry' of model populations to efficiently predict jailbreak susceptibility and transfer defenses, achieving high accuracy with significantly…

View →

cs.LGcs.AIcs.CLRecentApr 20, 2026

Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal

The paper demonstrates that integrating Sparse Autoencoders (SAEs) into transformer residual streams significantly enhances the robustness of Large Language Models against various jailbreak attacks by…

View →

cs.AIRecentMay 28, 2026

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee

The paper introduces Temporal Logit Observability (TLO), a training-free diagnostic that analyzes the decoding process to reveal the temporal patterns of LLM safety failures, showing that failure mech…

View →

cs.CRcs.AIcs.CLRecentApr 27, 2026

Jailbreaking Frontier Foundation Models Through Intention Deception

Xinhe Wang, Katia Sycara, Yaqi Xie

The paper introduces a novel multi-turn jailbreaking method that exploits the vulnerability of safe completion models by gradually building conversational trust, and it also uncovers a new vulnerabili…

View →

cs.CLcs.CRRecentApr 1, 2026

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea

The paper introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that successfully bypasses LLM safety mechanisms by eliciting malicious content through a sequence of single…

View →