~ similar to 2604.02372v1· 20 results
This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…
Rui Wen, Mark Russinovich, Andrew Paverd, Jun Sakuma +1 more
The paper introduces MetaBackdoor, a novel class of LLM backdoor attacks that exploits positional encoding (length-based triggers) rather than requiring modifications to the textual content.
Shengfang Zhai, Xiaoyang Ji, Yuling Shi, Haoran Gao +5 more
The paper introduces BadDLM, a unified framework that demonstrates a new class of backdoor vulnerabilities in Diffusion Language Models (DLMs) by exploiting their forward masking process across divers…
Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang +2 more
The paper introduces BadStyle, a novel backdoor attack framework that generates natural, stealthy poisoned samples using LLMs to compromise various LLMs with high success rates and robust activation.
The paper proposes a novel cross-modal backdoor attack that exploits the vulnerability of lightweight connectors in multimodal LLMs, demonstrating high attack success rates across different modalities…
Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu +1 more
Patcher is a post-hoc defense framework that repairs backdoored large language models by localizing hidden triggers and patching the model using only a single reported failure case.
The paper introduces Rotated Robustness (RoR), a training-free defense that uses orthogonal transformations to prevent catastrophic model collapse in LLMs caused by hardware bit-flip attacks.
Wenhan Chang, Tianqing Zhu, Ping Xiong, Faqian Guan +1 more
The paper proposes Two-stage Backdoor Hijacking (TSBH) to create persistent, trigger-activated malicious behaviors by manipulating the observable Chain-of-Thought (CoT) process in Large Language Model…
The paper compares two sparse autoencoder architectures, finding that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in isolating backdoor-related features in language models.
Rui Yin, Tianxu Han, Naen Xu, Changjiang Li +7 more
The paper proposes a novel method to inject reliable, sustained backdoors into LLMs by compiling an activation steering vector into model weights, ensuring the backdoor only activates upon a specific…
The paper proposes Open-Book Benign Rewriting (OBBR), a novel defense mechanism that uses LLM rewriting with benign samples to neutralize data poisoning attacks against LLMs, significantly improving s…
The paper introduces Critical-CoT, a novel two-stage fine-tuning defense framework that equips LLMs with critical thinking abilities to detect and reject malicious reasoning steps introduced by advanc…
Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng +5 more
SkillTrojan introduces a novel backdoor attack targeting the composition of reusable skills in agent systems, demonstrating high attack success rates with minimal impact on normal system functionality…
The paper introduces SHADOWMASK, the first systematic backdoor attack targeting Masked Diffusion Language Models (MDLMs), demonstrating near-100% attack success while preserving clean model utility.
Jack Sanderson, Yihan Wang, Xiaoqian Lu, Gautam Kamath +1 more
The paper introduces the threat model of sequential data poisoning, demonstrating that multiple, collaborating attackers can exploit compound vulnerabilities in LLM post-training pipelines that are in…
ProtoGuard-SL introduces a server-side defense that enhances vertical split learning robustness against backdoor attacks by enforcing class-conditional consistency in the embedding space.
The paper introduces BadSkill, a novel backdoor attack formulation that targets third-party agent skills by poisoning the embedded model artifacts, achieving high attack success rates across various m…
Kai Wang, Jiale Zhang, Chengcheng Zhu, Chuang Ma +1 more
The paper proposes Hydra, a framework to stabilize and control the injection of multiple, conflicting backdoor triggers into text-to-image diffusion models, ensuring high attack reliability while main…
This paper introduces a fingerprinting method that exploits subtle numerical deviations in the inference system components (like the engine or hardware) to reliably identify the specific components us…
BackFlush introduces a novel, knowledge-free framework that detects and eliminates unknown backdoor attacks in LLMs while simultaneously preserving existing watermarks, achieving high detection rates…