~ similar to 2605.23196v1· 20 results
The paper evaluates prompt-injection defenses for educational LLM tutors, demonstrating that optimal security requires balancing adversarial robustness, usability, and latency, and proposing a compreh…
The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…
Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, João Vitor Pavan +6 more
GuardNet is a lightweight, ensemble-based guardrail system using shallow neural networks that provides robust and efficient detection of Prompt Injection and Jailbreak attacks on LLMs, suitable for pr…
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
Lixing Lin, Juli You, Yue Li, Luyun Lin +3 more
Reflect-Guard enhances LLM safety classifiers by integrating logical self-reflection, significantly improving detection of sophisticated adversarial jailbreak prompts.
The paper introduces GuardPhish, a large-scale dataset and evaluation framework, demonstrating that even high-performing open-source LLMs can generate actionable phishing content despite accurate inte…
The paper evaluates prompt injection detection in a deployment-aware, multi-regime framework, finding that detection performance is highly dependent on the operational setting and that no single detec…
Priyal Deep, Shane Emmons, Amy Fox, Kyle Bacon +3 more
The paper evaluates prompt injection defenses and finds that only external output filtering, implemented in application code, reliably prevents secret leaks from LLMs, demonstrating that model-based d…
Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more
The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…
Yulin Chen, Tri Cao, Haoran Li, Yue Liu +6 more
The paper introduces WebAgentGuard, a novel reasoning-driven, multimodal guard model that effectively detects prompt injection attacks in vulnerable web agents without compromising their functionality…
Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more
The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple unsafe categories.
Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more
The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple distinct unsafe categorie…
PlanGuard is a training-free defense framework that uses an isolated Planner and hierarchical verification to defend LLM agents against Indirect Prompt Injection by verifying the consistency of planne…
Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more
This paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a dynamic defense mechanism that traces and sanitizes untrusted control content i…
Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more
The paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a defense mechanism that detects and sanitizes backdoor content planted across mul…
AgentWatcher is a novel, rule-based monitor designed to detect prompt injection attacks in LLM agents by focusing detection on causally influential context segments, thereby improving scalability and…
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving state-of-the-art safety performance with massiv…
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…
The paper introduces ESLD, an architecture that improves prompt injection defense by directly analyzing the internal latent representation of an existing guard model, achieving faster and more accurat…
The paper introduces Prompt Control-Flow Integrity (PCFI), a priority-aware runtime defense that models LLM prompts as structured segments to intercept prompt injection attacks with high accuracy and…