~ similar to 2605.19159v1· 20 results
Kaisheng Fan, Weizhe Zhang, Yishu Gao, Tegawendé F. Bissyandé +1 more
The paper introduces Tail-risk Intrinsic Geometric Smoothing (TIGS), a plug-and-play, inference-time defense that suppresses backdoor attacks on LLMs by structurally smoothing the attention mechanism…
The paper introduces a robust, two-part framework (HyPE and HyPS) using hyperbolic geometry to efficiently detect and sanitize malicious prompts targeting Vision-Language Models (VLMs).
The paper introduces PromptFuzz-SC, a novel semantic-character dual-space mutation framework, demonstrating that combining both semantic and character-level attacks significantly improves the robustne…
The paper introduces a dual-dimension evaluation for universal adversarial attacks on Vision-Language Models (VLMs), demonstrating that high reported attack success rates significantly overestimate th…
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
LocalAlign proposes a generalizable prompt injection defense by generating near-target adversarial examples, which enforces a tighter robustness boundary around the correct model response.
The paper evaluates prompt injection detection in a deployment-aware, multi-regime framework, finding that detection performance is highly dependent on the operational setting and that no single detec…
The paper introduces Rotated Robustness (RoR), a training-free defense that uses orthogonal transformations to prevent catastrophic model collapse in LLMs caused by hardware bit-flip attacks.
The paper introduces ImageProtector, a user-side method that embeds an imperceptible perturbation into images to prevent Multi-modal Large Language Models (MLLMs) from analyzing and extracting sensiti…
Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang +2 more
The paper introduces CrossMPI, a novel cross-modal prompt injection attack that uses image-only perturbations to steer the interpretation of both textual and visual inputs in Large Vision-Language Mod…
Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao +5 more
This paper introduces the concept of Safety Geometry Collapse, demonstrating that multimodal inputs degrade the safety separation of LLMs, and proposes ReGap, a training-free method that adaptively co…
The paper proposes a unified, architecture-agnostic framework that significantly improves the robustness of deepfake image detectors against adversarial attacks by focusing on higher-order frequency s…
The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.
The paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in LLM prompts, achieving over 85%…
The paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense that proactively identifies and neutralizes malicious components in LLM prompts, achieving over 85% reduction…
The paper demonstrates that integrating Sparse Autoencoders (SAEs) into transformer residual streams significantly enhances the robustness of Large Language Models against various jailbreak attacks by…
SALLIE introduces a lightweight, modal-agnostic runtime detection framework that effectively safeguards LLMs and VLMs against both textual and visual jailbreaks and prompt injections without performan…
ContractShield is a robust multimodal framework that uses a novel three-level fusion mechanism to accurately detect multiple types of vulnerabilities in obfuscated smart contracts, significantly outpe…
The paper evaluates prompt-injection defenses for educational LLM tutors, demonstrating that optimal security requires balancing adversarial robustness, usability, and latency, and proposing a compreh…
Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov +5 more
The paper proposes a debiasing fine-tuning technique to efficiently enhance the robustness of Large Language Models against semantically similar but textually altered prompts.