Papers similar to 2605.12565v1

~ similar to 2605.12565v1· 20 results

cs.LGcs.CRRecentMay 12, 2026

Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation

Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith

The paper introduces Persona-Conditioned Adversarial Prompting (PCAP), a method that significantly improves LLM red-teaming by simulating diverse attacker personas, leading to the discovery of more co…

View →

cs.CRcs.AIcs.LGRecentMay 9, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more

This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…

View →

cs.AIcs.CRRecentMay 5, 2026

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…

View →

cs.CLcs.CRRecentMay 4, 2026

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

Mario Rodríguez Béjar, Francisco J. Cortés-Delgado, S. Braghin, Jose L. Hernández-Ramos

ContextualJailbreak introduces an evolutionary red-teaming strategy that performs automated search over simulated multi-turn primed dialogues, achieving high jailbreak rates across multiple state-of-t…

View →

cs.CRcs.AIcs.MARecentApr 23, 2026

AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

Tanmay Gautam, Alireza Bahramali, Sandeep Atluri

AutoRISE proposes optimizing the entire attack strategy—by searching over executable programs—rather than just optimizing prompts, achieving significant improvements in red-teaming large language mode…

View →

cs.CRRecentMay 4, 2026

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

Kemal Derya, Berk Sunar

The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…

View →

cs.CRRecentMay 28, 2026

Strengthening Polymorphic Prompt Assembling: Dynamic Separator Generation Against Emerging Prompt Injection Attacks

Nima Dorzhiev, Peng Liu

The paper introduces dynamic, per-request separator generation for Polymorphic Prompt Assembling (PPA), significantly reducing the blast-radius vulnerability to prompt injection attacks by ensuring un…

View →

cs.CRRecentApr 5, 2026

SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement

Zenghao Duan, Yuxin Tian, Zhiyi Yin, Liang Pang +5 more

SkillAttack is a red-teaming framework that dynamically tests the exploitability of latent vulnerabilities in LLM agent skills using adversarial prompting, demonstrating that even benign skills pose s…

View →

cs.CRcs.AIcs.CLRecentApr 9, 2026

PIArena: A Platform for Prompt Injection Evaluation

Runpeng Geng, Chenlong Yin, Yanting Wang, Ying Chen +1 more

The paper introduces PIArena, a unified and extensible platform designed to address the lack of standardized evaluation for prompt injection, revealing critical limitations in current state-of-the-art…

View →

cs.CRcs.LGRecentMay 23, 2026

Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more

The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.

View →

cs.CRRecentMay 20, 2026

Adversarial Reframing: A Framework for Targeted Generation in Language Models

Shahnewaz Karim Sakib, Swati Kar, Anindya Bijoy Das

The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…

View →

cs.CRcs.AIRecentJun 2, 2026

NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense

Zhongyang Lin, Ziran Zhao, Feifei Zhai, Pengyuan Liu

NeuroArmor is a white-box runtime defense that uses prompt-specific safe variants to selectively detect and mitigate jailbreak attacks, significantly reducing attack success rates while maintaining a…

View →

cs.LGcs.AIcs.CRRecentMar 25, 2026

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye +2 more

The paper demonstrates that using advanced AI agents in an autoresearch loop can discover novel and highly effective adversarial attack algorithms, significantly advancing the state-of-the-art for jai…

View →

cs.CRcs.AIcs.CLRecentApr 22, 2026

Adaptive Instruction Composition for Automated LLM Red-Teaming

Jesse Zymet, Andy Luo, Swapnil Shinde, Sahil Wadhwa +1 more

The paper introduces Adaptive Instruction Composition, a novel framework that uses reinforcement learning to intelligently combine crowdsourced texts, significantly improving the effectiveness and div…

View →

cs.CRcs.CLRecentApr 24, 2026

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea, Christopher Parisien

The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…

View →

cs.CRcs.AIRecentMar 25, 2026

Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search

Yulin Shen, Xudong Pan, Geng Hong, Min Yang

The paper introduces Tree structured Injection for Payloads (TIP), a novel black-box attack framework that reliably generates stealthy injection payloads to seize control of LLM agents utilizing the M…

View →

cs.CRcs.AIcs.CLRecentApr 6, 2026

Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities

Charafeddine Mouzouni

The paper systematically maps LLM agent vulnerabilities by testing 10,000 prompt variations, finding that 'goal reframing' language is the primary trigger for exploitation, rather than broad adversari…

View →

cs.CRcs.AIcs.CLRecentJun 3, 2026

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

Nicholas Saban

The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…

View →

cs.CRcs.AIRecentMay 10, 2026

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more

The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…

View →

cs.CRcs.AIRecentMar 26, 2026

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Ron Litvak

The security of LLM agents is critically dependent on their system prompt configuration, which creates a brittle attack surface that can be exploited by attackers inverting the prompt's core assumptio…

View →