Papers similar to 2605.30883v1

~ similar to 2605.30883v1· 20 results

cs.CRcs.CLRecentMay 1, 2026

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu +3 more

The paper proposes SRTJ, a Self-Evolving Rule-Driven Training-Free Jailbreak framework that systematically discovers and refines attack strategies using rule composition and feedback to achieve robust…

View →

cs.CRcs.CLRecentMay 11, 2026

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

Chiyu Zhang, Huiqin Yang, Bendong Jiang, Xiaolei Zhang +7 more

The paper introduces LITMUS, a novel benchmark that rigorously tests LLM agents for dangerous, physical-layer behavioral jailbreaks in real OS environments, revealing that current agents frequently ex…

View →

cs.CRcs.AIRecentMay 7, 2026

LoopTrap: Termination Poisoning Attacks on LLM Agents

Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu +3 more

The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.

View →

cs.CRRecentMay 28, 2026

Evolving Skill-Structured Attack Memory Enhances LLM Jailbreaking

Junke Zhang, Jianwei Wang, Sishuo Chen, Yizhang He +2 more

The paper proposes MemoAttack, a memory-driven black-box jailbreak framework that systematically models, evolves, and selects attack experiences to significantly enhance LLM jailbreaking success rates…

View →

cs.CRcs.AIcs.LGRecentMay 9, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more

This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…

View →

cs.CRcs.SERecentMay 15, 2026

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

Reinelle Jan Bugnot, Soohyeon Choi, Hoon Wei Lim, Yue Duan

This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…

View →

cs.MAcs.AIcs.CRRecentMay 8, 2026

OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing

Jianming Chen, Yawen Wang, Junjie Wang, Zhe Liu +2 more

OrchJail introduces an orchestration-guided fuzzing framework to systematically jailbreak tool-calling text-to-image agents by exploiting unsafe multi-step tool-orchestration patterns.

View →

cs.CRcs.AIcs.CLRecentMay 29, 2026

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more

This paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a dynamic defense mechanism that traces and sanitizes untrusted control content i…

View →

cs.CRcs.AIcs.CLRecentMay 29, 2026

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more

The paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a defense mechanism that detects and sanitizes backdoor content planted across mul…

View →

cs.CRcs.LGRecentApr 25, 2026

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

Kexin Chu

The paper proposes the Layered Attack Surface Model (LASM), a structural taxonomy that maps security threats and defenses across the complex, multi-layered architecture of AI agents, revealing signifi…

View →

cs.CRcs.AIRecentMay 13, 2026

AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills

Haomin Zhuang, Hanwen Xing, Yujun Zhou, Yuchen Ma +4 more

The paper introduces AgentTrap, a dynamic benchmark that measures LLM agent susceptibility to malicious side effects embedded within seemingly benign third-party skills, finding that agents often exec…

View →

cs.CRcs.AIRecentMay 6, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more

This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…

View →

cs.CRcs.AIcs.LGRecentMay 26, 2026

HARP: Measuring Harm Amplification in Multi-Agent LLM Systems

Md Hafizur Rahman, Zafaryab Haider, Tanzim Mahfuz, Prabuddha Chakraborty

The paper introduces HARP, a new methodology to measure how localized harm (like compromising one agent) can be amplified into significant, system-wide harm within complex multi-agent LLM workflows.

View →

cs.CRcs.LGRecentApr 22, 2026

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more

The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…

View →

cs.CRcs.AIcs.CLRecentApr 8, 2026

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen

The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…

View →

cs.CRcs.AIcs.CLRecentApr 22, 2026

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis, Seshu Tirupathi +1 more

This paper introduces a novel Function Hijacking Attack (FHA) that manipulates the tool selection process of agentic models, demonstrating a robust and context-agnostic threat to function calling LLMs…

View →

cs.CRcs.AIcs.CLRecentApr 13, 2026

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu +6 more

The paper introduces Salami Slicing Risk, a novel multi-turn jailbreak technique that accumulates harmful intent through numerous low-risk inputs, achieving state-of-the-art attack success rates again…

View →

cs.CRcs.AIcs.SERecentApr 7, 2026

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

Jiaren Peng, Zeqin Li, Chang You, Yan Wang +16 more

This paper provides the first comprehensive systematization and large-scale empirical evaluation of existing LLM-based Automated Penetration Testing (AutoPT) frameworks, offering a structured taxonomy…

View →

cs.CRcs.AIRecentMay 12, 2026

Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

Zhaojiacheng Zhou

The paper introduces Proteus, a self-evolving red-team framework that measures the adaptive leakage risk of LLM agent skills, demonstrating that current vetting methods significantly underestimate res…

View →

cs.CRRecentApr 24, 2026

Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation

Biagio Andreucci, Arcangelo Castiglione

Automation-Exploit is a multi-agent LLM framework that enables adaptive offensive security by using a digital twin to safely test and execute high-risk memory-corruption exploits on live targets.

View →