Papers similar to 2606.01738

~ similar to 2606.01738· 19 results

cs.CRcs.AIRecentMay 10, 2026

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more

The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…

View →

cs.CRcs.CLRecentMay 1, 2026

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu +3 more

The paper proposes SRTJ, a Self-Evolving Rule-Driven Training-Free Jailbreak framework that systematically discovers and refines attack strategies using rule composition and feedback to achieve robust…

View →

cs.CRcs.AIRecentApr 23, 2026

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

Naheed Rayhan, Sohely Jahan

The paper introduces Transient Turn Injection (TTI), a novel multi-turn attack technique that exploits stateless moderation in LLMs by distributing adversarial intent across isolated interactions, rev…

View →

cs.CLcs.AIcs.CRRecentMay 7, 2026

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang +5 more

The paper introduces TurnGate, a response-aware defense mechanism that detects the earliest turn in a multi-turn dialogue where the accumulated interaction enables a harmful action, significantly impr…

View →

cs.CRcs.AIRecentMay 30, 2026

MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety

Anushka Sheoran, Yiduo Hao

This paper introduces MultiTurnPSB, a multi-turn adversarial benchmark, demonstrating that the safety of medical AI chatbots degrades significantly under sustained, real-world adversarial prompting, r…

View →

cs.CRcs.LGRecentMay 23, 2026

Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more

The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.

View →

cs.CLcs.CRRecentApr 1, 2026

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea

The paper introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that successfully bypasses LLM safety mechanisms by eliciting malicious content through a sequence of single…

View →

cs.SDcs.AIcs.CLRecentMay 28, 2026

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang +1 more

This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…

View →

cs.AIRecentMay 28, 2026

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee

The paper introduces Temporal Logit Observability (TLO), a training-free diagnostic that analyzes the decoding process to reveal the temporal patterns of LLM safety failures, showing that failure mech…

View →

cs.CRRecentMay 20, 2026

Adversarial Reframing: A Framework for Targeted Generation in Language Models

Shahnewaz Karim Sakib, Swati Kar, Anindya Bijoy Das

The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…

View →

cs.CRcs.AIRecentApr 18, 2026

SafeDream: Safety World Model for Proactive Early Jailbreak Detection

Bo Yan, Weikai Lin, Yada Zhu, Song Wang

SAFEDREAM introduces a lightweight, external world-model framework that proactively detects multi-turn jailbreak attacks by modeling cumulative safety erosion and predicting early failure points.

View →

cs.CRcs.AIcs.CLRecentApr 13, 2026

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu +6 more

The paper introduces Salami Slicing Risk, a novel multi-turn jailbreak technique that accumulates harmful intent through numerous low-risk inputs, achieving state-of-the-art attack success rates again…

View →

cs.CRcs.AIRecentMay 6, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more

This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…

View →

cs.CRcs.AIRecentMay 18, 2026

Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks

Doohee You

The paper proposes the Triple-tier Anomaly Defense (TRIAD) framework, a predictive model that treats safety verification as a dynamic trajectory problem to detect cumulative, cross-modal poisoning in…

View →

cs.CRcs.AIRecentMay 11, 2026

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Haichang Gao

The paper introduces Disrupt-and-Rectify Smoothing (DR-Smoothing), a novel two-stage defense mechanism that significantly improves LLM security against jailbreaking attacks by restoring disrupted inpu…

View →

cs.CRcs.AIcs.LGRecentMay 9, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more

This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…

View →

cs.CRcs.AIcs.CLRecentApr 27, 2026

Jailbreaking Frontier Foundation Models Through Intention Deception

Xinhe Wang, Katia Sycara, Yaqi Xie

The paper introduces a novel multi-turn jailbreaking method that exploits the vulnerability of safe completion models by gradually building conversational trust, and it also uncovers a new vulnerabili…

View →

cs.CLcs.CRRecentMay 4, 2026

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

Mario Rodríguez Béjar, Francisco J. Cortés-Delgado, S. Braghin, Jose L. Hernández-Ramos

ContextualJailbreak introduces an evolutionary red-teaming strategy that performs automated search over simulated multi-turn primed dialogues, achieving high jailbreak rates across multiple state-of-t…

View →

cs.CRcs.CLcs.LGRecentApr 30, 2026

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Bowen Sun, Chaozhuo Li, Yaodong Yang, Yiwei Wang +1 more

TwinGate introduces a stateful dual-encoder defense framework using Asymmetric Contrastive Learning to detect malicious intent from fragmented, untraceable LLM queries with high recall and low false p…

View →