~ similar to 2605.19147v1· 20 results
The paper introduces SONAR, a prompt sanitization framework that uses natural language inference metrics to identify and remove malicious instructions injected into LLM prompts, achieving near-zero at…
The paper systematically evaluates various defense mechanisms against persistent memory attacks on LLM agents, finding that only tool-gating at the memory layer (Memory Sandbox) effectively mitigates…
The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.
This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…
The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…
The paper introduces an A*-inspired framework to generate highly effective and efficient adversarial prompts that cause LLMs to hallucinate commonsense errors while maintaining the original prompt's i…
Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang +2 more
The paper introduces BadStyle, a novel backdoor attack framework that generates natural, stealthy poisoned samples using LLMs to compromise various LLMs with high success rates and robust activation.
The paper introduces Critical-CoT, a novel two-stage fine-tuning defense framework that equips LLMs with critical thinking abilities to detect and reject malicious reasoning steps introduced by advanc…
The paper introduces Obsessive Experience Poisoning (OEP), a low-privilege black-box attack that poisons self-evolving LLM agents by generating locally correct but harmful experiences, causing dangero…
RefineRAG introduces a novel word-level poisoning framework that significantly enhances knowledge poisoning attacks against RAG systems, achieving state-of-the-art effectiveness and transferability to…
SilentRetrieval introduces a sophisticated, two-stage data poisoning attack that successfully hijacks Retrieval-Augmented Generation (RAG) systems by injecting adversarially crafted, yet highly fluent…
The paper introduces 'covert control attacks,' a novel and stealthy data poisoning method that teaches LLMs an information hiding scheme, allowing malicious instructions to be encoded and decoded and…
Wenhan Chang, Tianqing Zhu, Ping Xiong, Faqian Guan +1 more
The paper proposes Two-stage Backdoor Hijacking (TSBH) to create persistent, trigger-activated malicious behaviors by manipulating the observable Chain-of-Thought (CoT) process in Large Language Model…
Zhe Yu, Wenpeng Xing, Gaolei Li, Shuguang Xiong +3 more
The paper introduces CORDON-MAS, a compartmentalized framework that defends Retrieval-Augmented Generation (RAG) against knowledge poisoning by enforcing strict information-flow control, significantly…
Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo +4 more
SPARD is a defense framework that uses Safety-Projected Alternating optimization and Relevance-Diversity data selection to protect large language models from harmful fine-tuning attacks, achieving sup…
Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo +4 more
SPARD is a defense framework that uses Safety-Projected Alternating optimization and Relevance-Diversity data selection to mitigate harmful fine-tuning attacks that undermine LLM safety.
Karima Makhlouf, Lamiaa Basyoni, Syed Khaderi, Gabriel Marquez +3 more
This paper conducts a structured ablation study using a unified threat model to evaluate how various system factors (like model architecture and retrieval configuration) influence different types of p…
This paper demonstrates that existing open-weight LLM safeguards are vulnerable to simple, non-gradient-based attacks like abliteration and prefilling, significantly increasing the attack success rate…
Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more
This paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a dynamic defense mechanism that traces and sanitizes untrusted control content i…
Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more
The paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a defense mechanism that detects and sanitizes backdoor content planted across mul…