The paper introduces AgentTrap, a dynamic benchmark that measures LLM agent susceptibility to malicious side effects embedded within seemingly benign third-party skills, finding that agents often execute unsafe side effects while completing the visible user task.
Third-party skills are becoming the package ecosystem for LLM agents. They package natural-language instructions, helper scripts, templates, documents, and service configuration into reusable workflows. This makes skills useful, but it also introduces a new security problem: a malicious skill does not need to ask the model to perform an obviously harmful action. Instead, it can disguise the harmful behavior as part of a routine workflow, relying on the agent to execute that workflow with high-value permissions and limited human supervision. We introduce AgentTrap, a dynamic benchmark for evaluating whether LLM agents can use third-party skills while resisting malicious runtime behavior. AgentTrap contains 141 tasks: 91 malicious tasks and 50 benign utility tasks, covering 16 security-impact dimensions grounded in agent-skill supply-chain threats. In each task, the agent receives an ordinary user request, runs with installed skills that may contain malicious workflow elements, and is executed in a sandboxed environment. AgentTrap then judges complete trajectories for attack success, blocked or refused behavior, attack-not-triggered cases, and no-attack-evidence outcomes. Our central finding is that the most informative failures are not simple jailbreaks. Models often complete the visible user task while treating unsafe side effects introduced by the skill as part of the normal workflow. This motivates runtime evaluation of the concrete model--framework--workspace environment in which users actually delegate work. Code and data are available at https://github.com/zhmzm/AgentTrap and https://huggingface.co/datasets/zhmzm/AgentTrap.
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
The paper introduces BadSkill, a novel backdoor attack formulation that targets…
SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement
SkillAttack is a red-teaming framework that dynamically tests the exploitability…
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
The paper introduces Document-Driven Implicit Payload Execution (DDIPE) to demon…
Context Matters: Repository-Aware Security Analysis of the Agent Skill Ecosystem
This paper conducts a large-scale, repository-aware security analysis of AI agen…
Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis
This paper provides the first comprehensive security analysis of the Agent Skill…
"Elementary, My Dear Watson." Detecting Malicious Skills via Neuro-Symbolic Reasoning across Heterog…
The paper introduces MalSkills, a neuro-symbolic framework that detects maliciou…
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
SkillTrojan introduces a novel backdoor attack targeting the composition of reus…
SkillProbe: Security Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration
The paper proposes SkillProbe, a multi-agent security auditing framework, demons…