~ similar to 2604.21700v1· 20 results
Shengfang Zhai, Xiaoyang Ji, Yuling Shi, Haoran Gao +5 more
The paper introduces BadDLM, a unified framework that demonstrates a new class of backdoor vulnerabilities in Diffusion Language Models (DLMs) by exploiting their forward masking process across divers…
Rui Wen, Mark Russinovich, Andrew Paverd, Jun Sakuma +1 more
The paper introduces MetaBackdoor, a novel class of LLM backdoor attacks that exploits positional encoding (length-based triggers) rather than requiring modifications to the textual content.
Khang Tran, Yazan Boshmaf, Issa Khalil, NhatHai Phan +2 more
The paper introduces Poison-with-Style (PwS), a stealthy model poisoning attack that exploits developers' inherent code styles as covert triggers to make Code LLMs generate vulnerable code without exp…
Rui Yin, Tianxu Han, Naen Xu, Changjiang Li +7 more
The paper proposes a novel method to inject reliable, sustained backdoors into LLMs by compiling an activation steering vector into model weights, ensuring the backdoor only activates upon a specific…
The paper introduces BadSkill, a novel backdoor attack formulation that targets third-party agent skills by poisoning the embedded model artifacts, achieving high attack success rates across various m…
The paper proposes a novel cross-modal backdoor attack that exploits the vulnerability of lightweight connectors in multimodal LLMs, demonstrating high attack success rates across different modalities…
The paper proposes Open-Book Benign Rewriting (OBBR), a novel defense mechanism that uses LLM rewriting with benign samples to neutralize data poisoning attacks against LLMs, significantly improving s…
Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more
This paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a dynamic defense mechanism that traces and sanitizes untrusted control content i…
Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more
The paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a defense mechanism that detects and sanitizes backdoor content planted across mul…
The paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing the backdoor generalizes at the token feature level, and proposes robust behavioral and weight-level detectors f…
This paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing that the resulting backdoor generalizes at the token feature level, and proposes robust behavioral and weight-l…
Karima Makhlouf, Lamiaa Basyoni, Syed Khaderi, Gabriel Marquez +3 more
This paper conducts a structured ablation study using a unified threat model to evaluate how various system factors (like model architecture and retrieval configuration) influence different types of p…
Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu +1 more
Patcher is a post-hoc defense framework that repairs backdoored large language models by localizing hidden triggers and patching the model using only a single reported failure case.
The paper introduces 'covert control attacks,' a novel and stealthy data poisoning method that teaches LLMs an information hiding scheme, allowing malicious instructions to be encoded and decoded and…
The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…
This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…
Yifei Wang, Tianlin Li, Xiaohan Zhang, Yida Yang +2 more
This paper introduces a novel class of backdoor attacks that exploit the numerical side effects of LLM inference optimization, achieving high success rates while maintaining clean accuracy.
The paper compares two sparse autoencoder architectures, finding that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in isolating backdoor-related features in language models.
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
The paper empirically evaluates the security quality of LLM-generated code across various prompting methods, finding that while prompting alters the structure of weaknesses, it is insufficient to reli…