~ similar to 2605.26574v1· 20 results
BackFlush introduces a novel, knowledge-free framework that detects and eliminates unknown backdoor attacks in LLMs while simultaneously preserving existing watermarks, achieving high detection rates…
The paper introduces BadSkill, a novel backdoor attack formulation that targets third-party agent skills by poisoning the embedded model artifacts, achieving high attack success rates across various m…
Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru +1 more
The paper introduces PoisonForge, a comprehensive benchmark demonstrating that even a small number of targeted poisoned examples can significantly compromise the safety and reliability of instruction-…
The paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing the backdoor generalizes at the token feature level, and proposes robust behavioral and weight-level detectors f…
This paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing that the resulting backdoor generalizes at the token feature level, and proposes robust behavioral and weight-l…
This paper introduces the first backdoor attack specifically targeting pipeline parallelism in decentralized post-training, demonstrating that a limited adversary controlling an intermediate stage can…
Yuchen Shi, Xin Guo, Huajie Chen, Tianqing Zhu +2 more
The paper proposes Cluster Segregation Concealment (CSC), a novel defense that identifies and neutralizes backdoor triggers by relabeling poisoned samples to a virtual class, achieving near-zero attac…
This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…
Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng +5 more
SkillTrojan introduces a novel backdoor attack targeting the composition of reusable skills in agent systems, demonstrating high attack success rates with minimal impact on normal system functionality…
The paper introduces a novel, transferable learned attack (LT-MIA) that detects a universal 'signature of memorization' in language models, achieving high accuracy across diverse model architectures (…
Shengfang Zhai, Xiaoyang Ji, Yuling Shi, Haoran Gao +5 more
The paper introduces BadDLM, a unified framework that demonstrates a new class of backdoor vulnerabilities in Diffusion Language Models (DLMs) by exploiting their forward masking process across divers…
The paper constructs a large, adversarial malware dataset from real-world binaries, demonstrating high evasion rates and showing that even small amounts of poisoned data can severely compromise malwar…
This paper proposes a density-aware attack that constructs triggers by placing poisoned samples in low-density regions of the clean data distribution, achieving high attack success rates even after st…
The paper introduces TorchSight, an open-source local system using a fine-tuned Qwen 3.5 27B model that achieves high accuracy (95.0%) in classifying sensitive security documents without relying on ex…
Quang Duc Nguyen, Siyuan Liang, Yiming Li, Fushuo Huo +1 more
The paper proposes TimeGuard, a novel channel-wise pool training defense, to significantly improve the robustness of time series forecasting against backdoor attacks by addressing signal dilution and…
Yinbo Yu, Xueyu Yin, Jing Fang, Chunwei Tian +3 more
The paper proposes HTell, a fast and lightweight data-free backdoor detector that analyzes the abnormal response concentration of backdoored models on the target class using random latent probes appli…
This paper proposes SABLE, a method for generating semantically meaningful and in-distribution backdoor triggers for federated learning, demonstrating that such attacks remain a potent and practical t…
The paper introduces a hybrid CNN-CodeBERT framework for three-class credential leakage detection, significantly improving accuracy by explicitly distinguishing genuine secrets from non-secret placeho…
The paper proposes a novel hybrid CNN-CodeBERT framework for three-class credential leakage detection, significantly improving accuracy by explicitly distinguishing genuine secrets from weak or placeh…
The paper compares two sparse autoencoder architectures, finding that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in isolating backdoor-related features in language models.