~ similar to 2605.01462v1· 20 results
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
The paper tested the hypothesis that wrapping untrusted prompt inputs in mock tool calls would improve LLM robustness, but found that this technique generally fails and can even increase vulnerability…
Shihao Weng, Yang Feng, Jinrui Zhang, Xiaofei Xie +2 more
The paper introduces ARGUS, a defense mechanism that uses provenance-aware decision auditing to protect LLM agents from sophisticated, context-aware prompt injection attacks, significantly reducing th…
The paper proposes a multi-layered security framework to detect and mitigate SQL injection attacks that occur when Large Language Models translate natural language prompts into database queries.
Runpeng Geng, Chenlong Yin, Yanting Wang, Ying Chen +1 more
The paper introduces PIArena, a unified and extensible platform designed to address the lack of standardized evaluation for prompt injection, revealing critical limitations in current state-of-the-art…
Yulin Chen, Tri Cao, Haoran Li, Yue Liu +6 more
The paper introduces WebAgentGuard, a novel reasoning-driven, multimodal guard model that effectively detects prompt injection attacks in vulnerable web agents without compromising their functionality…
The paper evaluates prompt-injection defenses for educational LLM tutors, demonstrating that optimal security requires balancing adversarial robustness, usability, and latency, and proposing a compreh…
Tri Cao, Yulin Chen, Hieu Cao, Yibo Li +7 more
The paper proposes WARD, a robust and efficient defense model that secures web agents against prompt injection attacks embedded in web content, achieving high recall and low false positives even again…
The paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in LLM prompts, achieving over 85%…
The paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense that proactively identifies and neutralizes malicious components in LLM prompts, achieving over 85% reduction…
The paper introduces a dual-dimension evaluation for universal adversarial attacks on Vision-Language Models (VLMs), demonstrating that high reported attack success rates significantly overestimate th…
The paper introduces SONAR, a prompt sanitization framework that uses natural language inference metrics to identify and remove malicious instructions injected into LLM prompts, achieving near-zero at…
Priyal Deep, Shane Emmons, Amy Fox, Kyle Bacon +3 more
The paper evaluates prompt injection defenses and finds that only external output filtering, implemented in application code, reliably prevents secret leaks from LLMs, demonstrating that model-based d…
PlanGuard is a training-free defense framework that uses an isolated Planner and hierarchical verification to defend LLM agents against Indirect Prompt Injection by verifying the consistency of planne…
The paper introduces ESLD, an architecture that improves prompt injection defense by directly analyzing the internal latent representation of an existing guard model, achieving faster and more accurat…
ClawGuard is a novel runtime security framework that deterministically enforces user-confirmed rules at tool-call boundaries to protect LLM agents from indirect prompt injection.
Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more
The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…
The paper evaluates prompt injection detection in a deployment-aware, multi-regime framework, finding that detection performance is highly dependent on the operational setting and that no single detec…
The paper introduces Transient Turn Injection (TTI), a novel multi-turn attack technique that exploits stateless moderation in LLMs by distributing adversarial intent across isolated interactions, rev…