~ similar to 2604.20911v1· 20 results
The study evaluates how safety alignment affects autonomous security agents using a comprehensive trace-based benchmark, finding that while less-restricted models show gains, these effects are not uni…
The paper identifies and measures a critical failure mode where LLM agents violate policies by losing or corrupting directive-bearing state during the process of assembling the decision context, and p…
Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala +2 more
The paper identifies the Misattribution Gap, showing that memory-layer attacks (Semantic Norm Drift) can mimic model failure in multi-agent AI systems, and proposes novel detection and mitigation tech…
The paper introduces ASPI, a benchmark showing that requiring LLM agents to seek clarification significantly amplifies their vulnerability to prompt injection attacks.
The paper introduces AgentSecBench, a security evaluation framework that measures prompt injection, privacy leakage, and tool-use integrity in LLM agents by defining formal security games and testing…
The paper introduces the concept of policy-invisible violations in LLM agents and proposes Sentinel, a counterfactual graph simulation framework, which significantly improves policy enforcement accura…
This study empirically measures the consistency and success rate of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation capabilit…
This study empirically measures the consistency and effectiveness of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation rates am…
Baoyuan Wu, Qingshan Liu, Adel Bibi, Irwin King +1 more
The paper argues that the Authorization-Execution Gap (AEG)—the divergence between intended authorization and actual execution—is a critical safety and security flaw in open-world agents, requiring so…
The paper introduces a Contextual Integrity (CI) framework and a new benchmark (DelegateCI-Bench) to rewrite user queries sent to cloud LLMs, ensuring only task-essential information is retained while…
The paper introduces a novel framework to evaluate when and how AI agents should refuse harmful requests in offensive cybersecurity tasks, finding that most state-of-the-art models exhibit dangerously…
The paper demonstrates that many instruction-tuned language models suffer from 'silent commitment failure,' meaning they can produce confidently incorrect outputs without any warning signal, and intro…
The paper introduces 'abliteration,' a weight editing technique that successfully bypasses the refusal mechanism of safety-aligned Code LLMs, enabling scalable synthesis of vulnerable code from safe i…
The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…
The paper proposes Ablating Safety, a controlled protocol for removing safety alignment from language models, demonstrating that targeted de-alignment can significantly boost security performance whil…
The paper analyzes the failure modes of current AI containment methods when the agent itself is the adversary, deriving five necessary architectural requirements for durable safety.
The paper introduces WIRE, a pipeline for diagnosing live intra-policy rule conflicts in LLM agents by identifying and testing specific rule pairs within a single prompt policy that can co-govern a re…
The paper proposes an attestation-aware promotion gate to mitigate supply-chain risks in LLM pipelines by cryptographically verifying and enforcing claims about training and release artifacts before d…
Shihao Weng, Yang Feng, Jinrui Zhang, Xiaofei Xie +2 more
The paper introduces ARGUS, a defense mechanism that uses provenance-aware decision auditing to protect LLM agents from sophisticated, context-aware prompt injection attacks, significantly reducing th…
Wenjie Qu, Ming Xu, Peiran Wang, Shengfang Zhai +2 more
The paper proposes defining 'intent-to-execution integrity' as the necessary end-to-end correctness property for securing LLM agents, arguing that current defenses are insufficient due to untrusted co…